Title
Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents.
Abstract
The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.
Year
DOI
Venue
2018
10.4018/JDM.2018040101
JOURNAL OF DATABASE MANAGEMENT
Keywords
Field
DocType
Formatting Semantics,Information Retrieval,Metadata Extraction,PDF Document,Template
Metadata,Data mining,Information retrieval,Computer science,Semantics
Journal
Volume
Issue
ISSN
29
2
1063-8016
Citations 
PageRank 
References 
2
0.39
31
Authors
5
Name
Order
Citations
PageRank
Congfeng Jiang1102.93
Junming Liu220.39
Dongyang Ou321.74
Wang Yumei42313.46
Lifeng Yu5399.34