Title
Text-based measures of document diversity
Abstract
Quantitative notions of diversity have been explored across a variety of disciplines ranging from conservation biology to economics. However, there has been relatively little work on measuring the diversity of text documents via their content. In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content. The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus. The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals. We describe experimental results on several large data sets which suggest that the approach is effective and accurate in quantifying how diverse a document is relative to other documents in a corpus.
Year
DOI
Venue
2013
10.1145/2487575.2487672
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Keywords
DocType
Citations 
Diversity, Interdisciplinarity
Conference
19
PageRank 
References 
Authors
1.16
8
3
Name
Order
Citations
PageRank
Kevin Bache1191.16
David Newman2131973.72
Padhraic Smyth371481451.38