Title | ||
---|---|---|
A Dictionary Based Urdu Word Segmentation Using Maximum Matching Algorithm for Space Omission Problem |
Abstract | ||
---|---|---|
The foremost step in any Natural Language Processing system is Word Segmentation. Word segmentation means dividing a sentence into the words it consists. For this research purpose Urdu is selected because very less work has been done. In Urdu space cannot be used in marking word boundary because it is not consistently used. Urdu word segmentation is different from other Asian languages in that it consist both Space Omission and Space Insertion problem. This paper discusses these problems and suggests a technique that solves both of these problems. It uses simple and already used basic techniques in a different way to develop an efficient Segmentation Algorithm. Morphological analysis of Urdu Text is also taken into account. Dictionary is used for verification and identification of Urdu Words. This work has been tested on words collected from Geo, Jang, BBC news sites and other online documents available on internet. The proposed algorithm has been tested on 11,995 words and 97.2% of these words are segmented correctly. |
Year | DOI | Venue |
---|---|---|
2012 | 10.1109/IALP.2012.11 | IALP |
Keywords | Field | DocType |
research purpose urdu,efficient segmentation algorithm,space omission,word segmentation,space omission problem,urdu words,word boundary,urdu text,urdu word segmentation,maximum matching algorithm,space insertion problem,urdu space,pattern matching,internet,electronic publishing,natural language processing,dictionaries,text analysis | Computer science,Artificial intelligence,Natural language processing,Word processing,The Internet,Segmentation,Algorithm,Matching (graph theory),Text segmentation,Speech recognition,Urdu,Pattern matching,Sentence | Conference |
ISSN | Citations | PageRank |
2159-1962 | 1 | 0.36 |
References | Authors | |
1 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rabiya Rashid | 1 | 1 | 0.36 |
Seemab Latif | 2 | 27 | 5.71 |