Title | ||
---|---|---|
Document embeddings learned on various types of n-grams for cross-topic authorship attribution. |
Abstract | ||
---|---|---|
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author. |
Year | Venue | Field |
---|---|---|
2018 | Computing | Mathematical optimization,Feature vector,Linear model,Writing style,Attribution,Paragraph,Artificial intelligence,Natural language processing,Artificial neural network,Mathematics |
DocType | Volume | Issue |
Journal | 100 | 7 |
Citations | PageRank | References |
1 | 0.35 | 13 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Helena Gómez-Adorno | 1 | 40 | 16.01 |
Juan Pablo Posadas-Durán | 2 | 13 | 3.72 |
Grigori Sidorov | 3 | 5 | 4.17 |
david pinto | 4 | 26 | 7.99 |