Title
Extracting structured data from unstructured document with incomplete resources
Abstract
We present a method for extracting structured elements of information, called structured data (sdata), from ocr'ed pages. The method first analyzes the layout of the page, building several concurrent layout structures. Then a tagging step is performed in order to tag textual elements based on their content. Combining the layout structures and the tagged elements, layout models for representing the structured data are inferred for the current page. These models are used to correct or tag some elements missed by the tagging step. The final set of structured data is extracted. An evaluation is presented.
Year
DOI
Venue
2015
10.1109/ICDAR.2015.7333766
ICDAR '15 Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR)
Keywords
Field
DocType
document layout analysis
Information retrieval,Computer science,Document layout analysis,Data extraction,Data model
Conference
ISSN
Citations 
PageRank 
1520-5363
5
0.55
References 
Authors
3
1
Name
Order
Citations
PageRank
Hervé Déjean137748.52