Title
Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus.
Abstract
We compare the performance of character n-gram features ((n=3{-}8)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams ((n=5{-}8)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders ((n=1{-}2) for words and (n=3{-}8) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features.
Year
Venue
Field
2017
CLEF
Feature selection,Computer science,Symbol,Profiling (computer programming),Support vector machine,Speech recognition,Gender and Language,Bigram,Natural language processing,Artificial intelligence
DocType
Citations 
PageRank 
Conference
1
0.36
References 
Authors
4
4
Name
Order
Citations
PageRank
Miguel A. Sanchez-Perez1152.81
ilia markov255.23
Helena Gómez-Adorno34016.01
Grigori Sidorov439860.51