Exploiting poly-lingual documents for improving text categorization effectiveness - Citegraph

Paper Info

Title
Exploiting poly-lingual documents for improving text categorization effectiveness

Abstract
With the globalization of business environments and rapid emergence and proliferation of the Internet, organizations or individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). Prevalent document management practice is to use categories to organize this ever-increasing volume of poly-lingual documents for subsequent searches and accesses. Poly-lingual text categorization (PLTC) refers to the automatic learning of text categorization models from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization models. Although PLTC can be approached as multiple, independent monolingual text categorization problems, this naive PLTC approach employs only the training documents of the same language to construct a monolingual classifier and thus fails to exploit the opportunity offered by poly-lingual training documents. In this study, we propose a feature-reinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) approach as a performance benchmark, the empirical evaluation results show that our proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique over its counterpart across a range of training sizes.

Year	DOI	Venue
2014	10.1016/j.dss.2013.08.001	Decision Support Systems
Keywords	Field	DocType
poly-lingual text categorization,independent monolingual text categorization,proposed fr-pltc technique,training document,text categorization effectiveness,poly-lingual training document,different language,poly-lingual document,induced text categorization model,monolingual classifier,preclassified training document,text mining,document management	Data mining,Text mining,Document management system,Computer science,Exploit,Automatic learning,Natural language processing,Artificial intelligence,Text categorization,Classifier (linguistics),The Internet	Journal
Volume	ISSN	Citations
57,	0167-9236	1
PageRank	References	Authors
0.35	34	5

Authors (5 rows)

Cited by (1 rows)

References (34 rows)

Name	Order	Citations	PageRank
Chih-ping Wei	1	743	74.20
Chin-Sheng Yang	2	94	8.35
Ching-Hsien Lee	3	1	0.69
Huihua Shi	4	5	1.16
Christopher C. Yang	5	1590	138.09

1