Title
Converting heterogeneous statistical tables on the web to searchable databases.
Abstract
Much of the world's quantitative data reside in scattered web tables. For a meaningful role in Big Data analytics, the facts reported in these tables must be brought into a uniform framework. Based on a formalization of header-indexed tables, we proffer an algorithmic solution to end-to-end table processing for a large class of human-readable tables. The proposed algorithms transform header-indexed tables to a category table format that maps easily to a variety of industry-standard data stores for query processing. The algorithms segment table regions based on the unique indexing of the data region by header paths, classify table cells, and factor header category structures of two-dimensional as well as the less common multidimensional tables. Experimental evaluations substantiate the algorithmic approach to processing heterogeneous tables. As demonstrable results, the algorithms generate queryable relational database tables and semantic-web triple stores. Application of our algorithms to 400 web tables randomly selected from diverse sources shows that the algorithmic solution automates end-to-end table processing.
Year
DOI
Venue
2016
10.1007/s10032-016-0259-1
IJDAR
Keywords
Field
DocType
Document analysis, Table segmentation, Table analysis, Table header factoring, End-to-end table processing, Table headers, Queries over table data
Row,Data mining,Decision table,Relational database,Computer science,Search engine indexing,Foreign key,Header,Big data,Table (information),Database
Journal
Volume
Issue
ISSN
19
2
1433-2825
Citations 
PageRank 
References 
4
0.43
46
Authors
4
Name
Order
Citations
PageRank
David W. Embley11915480.08
Mukkai Krishnamoorthy2756106.02
George Nagy3913105.94
Sharad C. Seth467193.61