Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study. - Citegraph

Paper Info

Title
Language Independent n-Gram-Based Text Categorization with Weighting Factors: A Case Study.

Abstract
We introduce a new language independent text categorization technique based on n-grams profile representation of restricted size of both document and a category, an n-gram weighting factors scheme, and a simple algorithm for comparing profiles. The technique does not require any morphological analysis of texts, any preprocessing steps, or any prior information about document content or language. We apply it to the text categorization problem in two widely spoken yet paradigmatically quite different languages – English and Arabic, thus demonstrating language-independence. We used their publicly available document collections – 20-Newsgroups and Mesleh-10, respectively. Experimental results presented in terms of macro- and micro-averaged F1 measures imply that the new technique outperforms other n-gram based and bag-of-words machine learning techniques when applied to English and Arabic text categorization.

Year	Venue	Field
2015	JIDM	Categorization,Weighting,Computer science,Preprocessor,Natural language processing,Language identification,Artificial intelligence,n-gram,Constructed language,Text categorization,Macro
DocType	Volume	Issue
Journal	6	1
Citations	PageRank	References
0	0.34	14
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (14 rows)

Name	Order	Citations	PageRank
Jelena Graovac	1	4	1.80
Jovana J. Kovacevic	2	12	2.15
Gordana Pavlovic-Lazetic	3	35	7.82

1