Title
CiteSeerX data: semanticizing scholarly papers.
Abstract
Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.
Year
DOI
Venue
2016
10.1145/2928294.2928306
SBD@SIGMOD
Field
DocType
Citations 
Data mining,World Wide Web,Information retrieval,Computer science,Digital library,Data management,Big data,Body text
Conference
2
PageRank 
References 
Authors
0.37
20
4
Name
Order
Citations
PageRank
Jian Wu132.43
Chen Liang2637.53
Huaiyu Yang3110.99
C. Lee Giles4111541549.48