Title
Document embeddings learned on various types of n-grams for cross-topic authorship attribution.
Abstract
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
Year
Venue
Field
2018
Computing
Mathematical optimization,Feature vector,Linear model,Writing style,Attribution,Paragraph,Artificial intelligence,Natural language processing,Artificial neural network,Mathematics
DocType
Volume
Issue
Journal
100
7
Citations 
PageRank 
References 
1
0.35
13
Authors
4
Name
Order
Citations
PageRank
Helena Gómez-Adorno14016.01
Juan Pablo Posadas-Durán2133.72
Grigori Sidorov354.17
david pinto4267.99