Abstract | ||
---|---|---|
Automating the conversion of human-readable HTML tables into machine-readable relational tables will enable end-user query processing of the millions of data tables found on the web. Theoretically sound and experimentally successful methods for index-based segmentation, extraction of category hierarchies, and construction of a canonical table suitable for direct input to a relational database are demonstrated on 200 heterogeneous web tables. The methods are scalable: the program generates the 198 Access compatible CSV files in ~0.1s per table (two tables could not be indexed). |
Year | DOI | Venue |
---|---|---|
2014 | 10.1109/DAS.2014.9 | Document Analysis Systems |
Keywords | Field | DocType |
header cross-product,wang category,header factoring,table segmentation,canonical relational table,table index,layout,indexing,relational databases,world wide web,relational database,internet,text analysis,html | Row,Decision table,Information retrieval,Relational database,Computer science,Segmentation,Search engine indexing,Foreign key,Table (information),Database,Scalability | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
George Nagy | 1 | 913 | 105.94 |
Sharad C. Seth | 2 | 671 | 93.61 |
David W. Embley | 3 | 1915 | 480.08 |