Title
Entropic selection of concepts in networks of similarity between documents.
Abstract
Scientists have devoted many efforts to study the organization and evolution of science by leveraging the textual information contained in the title/abstract of scientific documents. However, only few studies focus on the analysis of the whole body of a document. Using the whole text of documents allows, instead, to unveil the organization of scientific knowledge using a network of similarity between articles based on their characterizing which can be extracted, for instance, through the ScienceWISE platform. However, such network has a remarkably high link density (36%) hindering the association of groups of documents to a given topic, because not all the are equally informative and useful to discriminate between articles. The presence of generic concepts generates a large amount of spurious connections in the system. To identify/remove these concepts, we introduce a method to gauge their relevance according to an information-theoretic approach. The significance of a concept $c$ is encoded by the distance between its maximum entropy, $S_{max}$, and the observed one, $S_c$. After removing within a certain distance from the maximum, we rebuild the similarity network and analyze its topic structure. The consequences of pruning are twofold: the number of links decreases, as well as the noise present in the strength of similarities between articles. Hence, the filtered network displays a more refined community structure, where each community contains articles related to a specific topic. Finally, the method can be applied to other kind of documents and works also in a coarse-grained mode, allowing the study of a corpus at different scales.
Year
Venue
Field
2017
arXiv: Physics and Society
Topic structure,Data mining,Community structure,Information retrieval,Textual information,Sociology of scientific knowledge,Computer science,Artificial intelligence,Principle of maximum entropy,Spurious relationship,Machine learning
DocType
Volume
Citations 
Journal
abs/1705.06510
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Andrea Martini101.35
Alessio Cardillo21117.63
Paolo De Los Rios3133.11