Abstract | ||
---|---|---|
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
|
Year | DOI | Venue |
---|---|---|
2020 | 10.1145/3329710 | ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) |
Keywords | Field | DocType |
Resource poor languages,Urdu NER corpus,Word2vec,deep learning,fastText,word embeddings | Computer science,Machine translation,Supervised learning,Urdu,Natural language processing,Artificial intelligence,Deep learning,Word2vec,Named-entity recognition | Journal |
Volume | Issue | ISSN |
19 | 1 | 2375-4699 |
Citations | PageRank | References |
0 | 0.34 | 8 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Safia Kanwal | 1 | 0 | 0.34 |
Kamran Malik | 2 | 1 | 1.75 |
Khurram Shahzad | 3 | 165 | 25.77 |
Faisal Aslam | 4 | 0 | 2.03 |
Zubair Nawaz | 5 | 0 | 0.68 |