Title
Language model based arabic word segmentation
Abstract
We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus and uses it to bootstrap an unsupervised algorithm to build the Arabic word segmenter from a large unsegmented Arabic corpus. The algorithm uses a trigram language model to determine the most probable morpheme sequence for a given input. The language model is initially estimated from a small manually segmented corpus of about 110,000 words. To improve the segmentation accuracy, we use an unsupervised algorithm for automatically acquiring new stems from a 155 million word unsegmented corpus, and re-estimate the model parameters with the expanded vocabulary and training corpus. The resulting Arabic word segmentation system achieves around 97% exact match accuracy on a test corpus containing 28,449 word tokens. We believe this is a state-of-the-art performance and the algorithm can be used for many highly inflected languages provided that one can create a small manually segmented corpus of the language of interest.
Year
DOI
Venue
2003
10.3115/1075096.1075147
ACL
Keywords
Field
DocType
unsegmented corpus,large unsegmented arabic corpus,training corpus,segmented arabic corpus,language model,unsupervised algorithm,arabic word segmenter,test corpus,approximate arabic,segmented corpus,arabic word segmentation system,arabic word segmentation,word segmentation
Morpheme,Word lists by frequency,Arabic,Segmentation,Computer science,Speech recognition,Prefix,Text segmentation,Natural language processing,Artificial intelligence,Vocabulary,Language model
Conference
Volume
Citations 
PageRank 
P03-1
64
6.37
References 
Authors
8
5
Name
Order
Citations
PageRank
Young-Suk Lee126425.78
K. Papineni24902323.22
Salim Roukos36248845.50
Ossama Emam410310.74
Hany Hassan527726.16