Text-based measures of document diversity - Citegraph

Paper Info

Title
Text-based measures of document diversity

Abstract
Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.

Year	DOI	Venue
2013	10.1145/2487575.2487672	Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Keywords	DocType	Citations
Diversity, Interdisciplinarity	Conference	19
PageRank	References	Authors
1.16	8	3

Authors (3 rows)

Cited by (19 rows)

References (8 rows)

Name	Order	Citations	PageRank
Kevin Bache	1	19	1.16
David Newman	2	1319	73.72
Padhraic Smyth	3	7148	1451.38

1