Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents. - Citegraph

Paper Info

Title
Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents.

Abstract
The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.

Year	DOI	Venue
2018	10.4018/JDM.2018040101	JOURNAL OF DATABASE MANAGEMENT
Keywords	Field	DocType
Formatting Semantics,Information Retrieval,Metadata Extraction,PDF Document,Template	Metadata,Data mining,Information retrieval,Computer science,Semantics	Journal
Volume	Issue	ISSN
29	2	1063-8016
Citations	PageRank	References
2	0.39	31
Authors
5

Authors (5 rows)

Cited by (2 rows)

References (31 rows)

Name	Order	Citations	PageRank
Congfeng Jiang	1	10	2.93
Junming Liu	2	2	0.39
Dongyang Ou	3	2	1.74
Wang Yumei	4	23	13.46
Lifeng Yu	5	39	9.34

1