Title
Cross language text categorization by acquiring multilingual domain models from comparable corpora
Abstract
In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian). In this paper we propose a novel approach to solve the cross language text categorization problem based on acquiring Multilingual Domain Models from comparable corpora in a totally unsupervised way and without using any external knowledge source (e.g. bilingual dictionaries). These Multilingual Domain Models are exploited to define a generalized similarity function (i.e. a kernel function) among documents in different languages, which is used inside a Support Vector Machines classification framework. The results show that our approach is a feasible and cheap solution that largely outperforms a baseline.
Year
Venue
Keywords
2005
ParallelText@ACL
classical monolingual text categorization,cross language tc task,comparable corpus,different target language,cross language text categorization,source language,generalized similarity function,multilingual domain models,multilingual domain model,kernel function,external knowledge source,different language,support vector machine,domain model
Field
DocType
Volume
Computer science,Support vector machine,Speech recognition,Artificial intelligence,Language identification,Natural language processing,Text categorization,Domain model,Kernel (statistics)
Conference
W05-08
Citations 
PageRank 
References 
28
1.35
6
Authors
2
Name
Order
Citations
PageRank
Alfio Gliozzo125724.97
Carlo Strapparava22564230.59