Abstract | ||
---|---|---|
Scientists have devoted many efforts to study the organization and evolution of science by leveraging the textual information contained in the title/abstract of scientific documents. However, only few studies focus on the analysis of the whole body of a document. Using the whole text of documents allows, instead, to unveil the organization of scientific knowledge using a network of similarity between articles based on their characterizing which can be extracted, for instance, through the ScienceWISE platform. However, such network has a remarkably high link density (36%) hindering the association of groups of documents to a given topic, because not all the are equally informative and useful to discriminate between articles. The presence of generic concepts generates a large amount of spurious connections in the system. To identify/remove these concepts, we introduce a method to gauge their relevance according to an information-theoretic approach. The significance of a concept $c$ is encoded by the distance between its maximum entropy, $S_{max}$, and the observed one, $S_c$. After removing within a certain distance from the maximum, we rebuild the similarity network and analyze its topic structure. The consequences of pruning are twofold: the number of links decreases, as well as the noise present in the strength of similarities between articles. Hence, the filtered network displays a more refined community structure, where each community contains articles related to a specific topic. Finally, the method can be applied to other kind of documents and works also in a coarse-grained mode, allowing the study of a corpus at different scales. |
Year | Venue | Field |
---|---|---|
2017 | arXiv: Physics and Society | Topic structure,Data mining,Community structure,Information retrieval,Textual information,Sociology of scientific knowledge,Computer science,Artificial intelligence,Principle of maximum entropy,Spurious relationship,Machine learning |
DocType | Volume | Citations |
Journal | abs/1705.06510 | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andrea Martini | 1 | 0 | 1.35 |
Alessio Cardillo | 2 | 111 | 7.63 |
Paolo De Los Rios | 3 | 13 | 3.11 |