Title
A Dictionary Based Urdu Word Segmentation Using Maximum Matching Algorithm for Space Omission Problem
Abstract
The foremost step in any Natural Language Processing system is Word Segmentation. Word segmentation means dividing a sentence into the words it consists. For this research purpose Urdu is selected because very less work has been done. In Urdu space cannot be used in marking word boundary because it is not consistently used. Urdu word segmentation is different from other Asian languages in that it consist both Space Omission and Space Insertion problem. This paper discusses these problems and suggests a technique that solves both of these problems. It uses simple and already used basic techniques in a different way to develop an efficient Segmentation Algorithm. Morphological analysis of Urdu Text is also taken into account. Dictionary is used for verification and identification of Urdu Words. This work has been tested on words collected from Geo, Jang, BBC news sites and other online documents available on internet. The proposed algorithm has been tested on 11,995 words and 97.2% of these words are segmented correctly.
Year
DOI
Venue
2012
10.1109/IALP.2012.11
IALP
Keywords
Field
DocType
research purpose urdu,efficient segmentation algorithm,space omission,word segmentation,space omission problem,urdu words,word boundary,urdu text,urdu word segmentation,maximum matching algorithm,space insertion problem,urdu space,pattern matching,internet,electronic publishing,natural language processing,dictionaries,text analysis
Computer science,Artificial intelligence,Natural language processing,Word processing,The Internet,Segmentation,Algorithm,Matching (graph theory),Text segmentation,Speech recognition,Urdu,Pattern matching,Sentence
Conference
ISSN
Citations 
PageRank 
2159-1962
1
0.36
References 
Authors
1
2
Name
Order
Citations
PageRank
Rabiya Rashid110.36
Seemab Latif2275.71