The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts - Citegraph

Paper Info

Title
The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts

Abstract
AbstractIn this article, first a comprehensive study of the impact of term weighting schemes on the topic modeling performance (i.e., LDA and DMM) on Arabic long and short texts is presented. We investigate six term weighting methods including Word count method (standard topic models), TFIDF, PMI, BDC, CLPB, and CEW. Moreover, we propose a novel combination term weighting scheme, namely, CmTLB. We utilize the mTFIDF that takes into account the missing terms and the number of the documents in which the term appears when calculating the term weight. For further robust term weight, we combine mTFIDF with two weighting methods. We evaluate CmTLB against the studied weighting schemes by the quality of the learned topics (topic visualization and topic coherence), classification, and clustering tasks. We applied weighting schemes to Latent Dirichlet allocation (LDA) and Dirichlet multinomial mixture (DMM) on eight Arabic long and short document datasets, respectively. The experiment results outline that appropriate weighting schemes can effectively improve topic modeling performance on Arabic texts. More importantly, our proposed CmTLB significantly outperforms the other weighting schemes. Secondly, we investigate whether the Arabic stemming process can improve topic modeling performance. We study the three approaches of Arabic stemming including root-based, stem-based, and statistical approaches. We also train topic models with weighting schemes on documents after applying four stemmers related to different stemming approaches. The results outline that applying the stemming process not only reduces the dimensionality of term-document matrix leading to fast estimation process, but also show enhancement of topic modeling performance both on short and long Arabic documents. Moreover, Farasa stemmer achieves the highest performance in most cases, since it prevents the ambiguity that may happen because of the blind removal of the affixes such as in root-based or stem-based stemmers.

Year	DOI	Venue
2020	10.1145/3405843	ACM Transactions on Asian and Low-Resource Language Information Processing
Keywords	DocType	Volume
Latent Dirichlet allocation (LDA), Dirichlet multinomial mixture (DMM), natural language processing (NLP), term weighting schemes, stemming process, arabic text	Journal	19
Issue	ISSN	Citations
6	2375-4699	1
PageRank	References	Authors
0.37	0	5

Authors (5 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Tinghuai Ma	1	107	11.50
Raeed Alsabri	2	1	0.70
Lejun Zhang	3	78	15.62
Bockarie Daniel Marah	4	1	0.37
Najla Al-Nabhan	5	19	6.49

1