Title
Exploiting Separation of Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging
Abstract
Research on the problem of morphological disambiguation of Arabic has noted that techniques developed for lexical disambiguation in English do not easily transfer over, since the affixation present in Arabic creates a very different tag set than for English, encoding both inflectional morphology and more complex tokenization sequences. This work takes a new approach to this problem based on a distinction between the open-class and closed-class categories of tokens, which differ both in their frequencies and in their possible morphological affixations. This separation simplifies the morphological analysis problem considerably, making it possible to use a Conditional Random Field model for joint tokenization and “core” part-of-speech tagging of the open-class items, while the closed-class items are handled by regular expressions. This work is therefore situated between data-driven approaches and those that use a morphological analyzer. For the tasks of tokenization and core part-of-speech tagging, the resulting system outperforms, on the given test set, a system that incorporates a morphological analyzer. We also evaluate the effects of the differences on parser performance when the tagger output is used for parser input.
Year
DOI
Venue
2011
10.1145/1929908.1929912
ACM Trans. Asian Lang. Inf. Process.
Keywords
Field
DocType
arabic,arabic tokenization,morphological analyzer,complex tokenization sequence,closed-class item,possible morphological affixations,core part-of-speech tagging,morphological analysis,lexical disambiguation,closed-class categories,exploiting separation,joint tokenization,morphological disambiguation,morphological analysis problem,part-of-speech tagging,closed-class category,regular expression,conditional random field
Computer science,Lexical analysis,Artificial intelligence,Natural language processing,Morphological analysis,Conditional random field,Tokenization (data security),Regular expression,Pattern recognition,Speech recognition,Parsing,Encoding (memory),Test set
Journal
Volume
Issue
Citations 
10
1
1
PageRank 
References 
Authors
0.35
9
1
Name
Order
Citations
PageRank
Seth Kulick122129.66