Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features. - Citegraph

Paper Info

Title
Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features.

Abstract
Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 scores of 90.9% (92.2%) for figures (figure captions) with the intersection-over-union (IOU) cut-off of 0.9 which is a significant improvement over other state-of-the-art methods.

Year	DOI	Venue
2022	10.1007/978-3-031-16802-4_5	International Conference on Theory and Practice of Digital Libraries (TPDL)
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
J. P. Naiman	1	0	0.34
Peter K. G. Williams	2	0	0.34
Alyssa Goodman	3	0	0.34

1