Abstract | ||
---|---|---|
HTML tables and spreadsheets on the Internet or in enterprise intranets usually contain valuable information, but are created ad-hoc. As a result, they often lack both systematic names for column headers and clear vocabulary for cell values.
This limits the re-use of such tables and creates a huge heterogeneity problem when comparing or aggregating multiple tables.
This paper aims to overcome this problem by automatically canonicalizing header names and cell values onto concepts, classes, entities and uniquely represented quantities registered in a knowledge base.
To this end, we devise a probabilistic graphical model that captures coherence dependencies between cells in tables and candidate items in the space of
concepts, entities and quantities.
We give specific consideration to quantities which are mapped into (measure, dimension, unit, magnitude) quadruple over a taxonomy of physical (e.g. power consumption), monetary (e.g. revenue), temporal (e.g. date) and dimensionless (i.e. counts ) measures.
Our experiments with Web tables from diverse domains demonstrate the viability of our method and its benefits over baselines. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2983323.2983772 | Proceedings of the 25th ACM International on Conference on Information and Knowledge Management |
Keywords | Field | DocType |
Information extraction,Data understanding,Data integration and aggregation | Data mining,Information retrieval,Computer science,Information extraction,Knowledge base,Probabilistic logic,Graphical model,Header,Vocabulary,Table (information),The Internet | Conference |
Citations | PageRank | References |
9 | 0.48 | 27 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yusra Ibrahim | 1 | 22 | 3.10 |
Mirek Riedewald | 2 | 1136 | 84.31 |
Gerhard Weikum | 3 | 12710 | 2146.01 |