Abstract | ||
---|---|---|
The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively. |
Year | DOI | Venue |
---|---|---|
2018 | 10.4018/JDM.2018040101 | JOURNAL OF DATABASE MANAGEMENT |
Keywords | Field | DocType |
Formatting Semantics,Information Retrieval,Metadata Extraction,PDF Document,Template | Metadata,Data mining,Information retrieval,Computer science,Semantics | Journal |
Volume | Issue | ISSN |
29 | 2 | 1063-8016 |
Citations | PageRank | References |
2 | 0.39 | 31 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Congfeng Jiang | 1 | 10 | 2.93 |
Junming Liu | 2 | 2 | 0.39 |
Dongyang Ou | 3 | 2 | 1.74 |
Wang Yumei | 4 | 23 | 13.46 |
Lifeng Yu | 5 | 39 | 9.34 |