Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus. - Citegraph

Paper Info

Title
Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus.

Abstract
We compare the performance of character n-gram features ((n=3{-}8)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams ((n=5{-}8)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders ((n=1{-}2) for words and (n=3{-}8) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.

Year	Venue	Field
2017	CLEF	Feature selection,Computer science,Symbol,Profiling (computer programming),Support vector machine,Speech recognition,Gender and Language,Bigram,Natural language processing,Artificial intelligence
DocType	Citations	PageRank
Conference	1	0.36
References	Authors
4	4

Authors (4 rows)

Cited by (1 rows)

References (4 rows)

Name	Order	Citations	PageRank
Miguel A. Sanchez-Perez	1	15	2.81
ilia markov	2	5	5.23
Helena Gómez-Adorno	3	40	16.01
Grigori Sidorov	4	398	60.51

1