Abstract | ||
---|---|---|
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest. |
Year | DOI | Venue |
---|---|---|
2003 | 10.3115/1075096.1075147 | ACL |
Keywords | Field | DocType |
unsegmented corpus,large unsegmented arabic corpus,training corpus,segmented arabic corpus,language model,unsupervised algorithm,arabic word segmenter,test corpus,approximate arabic,segmented corpus,arabic word segmentation system,arabic word segmentation,word segmentation | Morpheme,Word lists by frequency,Arabic,Segmentation,Computer science,Speech recognition,Prefix,Text segmentation,Natural language processing,Artificial intelligence,Vocabulary,Language model | Conference |
Volume | Citations | PageRank |
P03-1 | 64 | 6.37 |
References | Authors | |
8 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Young-Suk Lee | 1 | 264 | 25.78 |
K. Papineni | 2 | 4902 | 323.22 |
Salim Roukos | 3 | 6248 | 845.50 |
Ossama Emam | 4 | 103 | 10.74 |
Hany Hassan | 5 | 277 | 26.16 |