Title
The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts
Abstract
AbstractIn this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears when calculating the term weight. For further robust term weight, we combine mTFIDF with two weighting methods. We evaluate CmTLB against the studied weighting schemes by the quality of the learned topics (topic visualization and topic coherence), classification, and clustering tasks. We applied weighting schemes to Latent Dirichlet allocation (LDA) and Dirichlet multinomial mixture (DMM) on eight Arabic long and short document datasets, respectively. The experiment results outline that appropriate weighting schemes can effectively improve topic modeling performance on Arabic texts. More importantly, our proposed CmTLB significantly outperforms the other weighting schemes. Secondly, we investigate whether the Arabic stemming process can improve topic modeling performance. We study the three approaches of Arabic stemming including root-based, stem-based, and statistical approaches. We also train topic models with weighting schemes on documents after applying four stemmers related to different stemming approaches. The results outline that applying the stemming process not only reduces the dimensionality of term-document matrix leading to fast estimation process, but also show enhancement of topic modeling performance both on short and long Arabic documents. Moreover, Farasa stemmer achieves the highest performance in most cases, since it prevents the ambiguity that may happen because of the blind removal of the affixes such as in root-based or stem-based stemmers.
Year
DOI
Venue
2020
10.1145/3405843
ACM Transactions on Asian and Low-Resource Language Information Processing
Keywords
DocType
Volume
Latent Dirichlet allocation (LDA), Dirichlet multinomial mixture (DMM), natural language processing (NLP), term weighting schemes, stemming process, arabic text
Journal
19
Issue
ISSN
Citations 
6
2375-4699
1
PageRank 
References 
Authors
0.37
0
5
Name
Order
Citations
PageRank
Tinghuai Ma110711.50
Raeed Alsabri210.70
Lejun Zhang37815.62
Bockarie Daniel Marah410.37
Najla Al-Nabhan5196.49