Document embeddings learned on various types of n-grams for cross-topic authorship attribution. - Citegraph

Paper Info

Title
Document embeddings learned on various types of n-grams for cross-topic authorship attribution.

Abstract
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

Year	Venue	Field
2018	Computing	Mathematical optimization,Feature vector,Linear model,Writing style,Attribution,Paragraph,Artificial intelligence,Natural language processing,Artificial neural network,Mathematics
DocType	Volume	Issue
Journal	100	7
Citations	PageRank	References
1	0.35	13
Authors
4

Authors (4 rows)

Cited by (1 rows)

References (13 rows)

Name	Order	Citations	PageRank
Helena Gómez-Adorno	1	40	16.01
Juan Pablo Posadas-Durán	2	13	3.72
Grigori Sidorov	3	5	4.17
david pinto	4	26	7.99

1