Hybrid approach to extracting information from web-tables - Citegraph

Paper Info

Title
Hybrid approach to extracting information from web-tables

Abstract
This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.

Year	DOI	Venue
2006	10.1007/11940098_11	ICCPOL
Keywords	Field	DocType
table schema,table-schema extraction,study concern,hybrid approach,meaningful table,html document,decorative table,previous work,head extraction algorithm,table information,information extraction,text mining,machine learning	Computer science,Information extraction,Heuristics,Artificial intelligence,Natural language processing,Web tables,Information schema,HTML,Schema (psychology),Table (information),The Internet	Conference
Volume	ISSN	ISBN
4285	0302-9743	3-540-49667-X
Citations	PageRank	References
3	0.48	6
Authors
3

Authors (3 rows)

Cited by (3 rows)

References (6 rows)

Name	Order	Citations	PageRank
Sungwon Jung	1	320	59.65
Mi-Young Kang	2	40	11.87
Hyuk-Chul Kwon	3	136	29.02

1