Title | ||
---|---|---|
Comparison of Character n-grams and Lexical Features on Author, Gender, and Language Variety Identification on the Same Spanish News Corpus. |
Abstract | ||
---|---|---|
We compare the performance of character n-gram features ((n=3{-}8)) and lexical features (unigrams and bigrams of words), as well as their combinations, on the tasks of authorship attribution, author profiling, and discriminating between similar languages. We developed a single multi-labeled corpus for the three aforementioned tasks, composed of news articles in different varieties of Spanish. We used the same machine-learning algorithm, Liblinear SVM, in order to find out which features are more predictive and for which task. Our experiments show that higher-order character n-grams ((n=5{-}8)) outperform lower-order character n-grams, and the combination of all word and character n-grams of different orders ((n=1{-}2) for words and (n=3{-}8) for characters) usually outperforms smaller subsets of such features. We also evaluate the performance of character n-grams, lexical features, and their combinations when reducing all named entities to a single symbol “NE” to avoid topic-dependent features. |
Year | Venue | Field |
---|---|---|
2017 | CLEF | Feature selection,Computer science,Symbol,Profiling (computer programming),Support vector machine,Speech recognition,Gender and Language,Bigram,Natural language processing,Artificial intelligence |
DocType | Citations | PageRank |
Conference | 1 | 0.36 |
References | Authors | |
4 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Miguel A. Sanchez-Perez | 1 | 15 | 2.81 |
ilia markov | 2 | 5 | 5.23 |
Helena Gómez-Adorno | 3 | 40 | 16.01 |
Grigori Sidorov | 4 | 398 | 60.51 |