Hierarchical Clustering Analysis: The Best-Performing Approach at PAN 2017 Author Clustering Task. - Citegraph

Paper Info

Title
Hierarchical Clustering Analysis: The Best-Performing Approach at PAN 2017 Author Clustering Task.

Abstract
The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Calinski Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).

Year	DOI	Venue
2017	10.1007/978-3-319-98932-7_20	Lecture Notes in Computer Science
Keywords	Field	DocType
Author clustering,Hierarchical clustering,Authorship-link ranking	Hierarchical clustering,Computer science,Curse of dimensionality,Natural language processing,Artificial intelligence,Cluster analysis	Conference
Volume	ISSN	Citations
11018	0302-9743	2
PageRank	References	Authors
0.40	5	6

Authors (6 rows)

Cited by (2 rows)

References (5 rows)

Name	Order	Citations	PageRank
Helena Gómez-Adorno	1	40	16.01
Carolina Martín-Del-Campo-Rodríguez	2	2	0.40
Grigori Sidorov	3	398	60.51
Yuridiana Alemán	4	5	5.30
Darnes Vilariño	5	43	19.68
David Pinto	6	280	35.77

1