Language model based arabic word segmentation - Citegraph

Paper Info

Title
Language model based arabic word segmentation

Abstract
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix-stem-suffix (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.

Year	DOI	Venue
2003	10.3115/1075096.1075147	ACL
Keywords	Field	DocType
unsegmented corpus,large unsegmented arabic corpus,training corpus,segmented arabic corpus,language model,unsupervised algorithm,arabic word segmenter,test corpus,approximate arabic,segmented corpus,arabic word segmentation system,arabic word segmentation,word segmentation	Morpheme,Word lists by frequency,Arabic,Segmentation,Computer science,Speech recognition,Prefix,Text segmentation,Natural language processing,Artificial intelligence,Vocabulary,Language model	Conference
Volume	Citations	PageRank
P03-1	64	6.37
References	Authors
8	5

Authors (5 rows)

Cited by (64 rows)

References (8 rows)

Name	Order	Citations	PageRank
Young-Suk Lee	1	264	25.78
K. Papineni	2	4902	323.22
Salim Roukos	3	6248	845.50
Ossama Emam	4	103	10.74
Hany Hassan	5	277	26.16

1