U-REST: an unsupervised record extraction system - Citegraph

Paper Info

Title
U-REST: an unsupervised record extraction system

Abstract
In this paper, we describe a system that can extract recordstructures from web pages with no direct human supervision.Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products,company profiles, etc. We present a simplified frameworkfor studying the problem of unsupervised record extraction. one which separates the algorithms from the feature engineering.Our system, U-REST formalizes an approach tothe problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy.

Year	DOI	Venue
2007	10.1145/1242572.1242844	WWW
Keywords	Field	DocType
extensive survey,direct human supervision,extraction accuracy,html-embedded data tuples,unsupervised record extraction system,feature engineering,company profile,unsupervised record extraction,empirical evaluation,approach tothe problem,related system,clustering,web pages,structural similarity,machine learning	Data mining,World Wide Web,Information retrieval,Ranking,Web page,Computer science,Tuple,Artificial intelligence,Cluster analysis,Machine learning	Conference
Citations	PageRank	References
7	0.41	6
Authors
2

Authors (2 rows)

Cited by (7 rows)

References (6 rows)

Name	Order	Citations	PageRank
Yuan Kui Shen	1	12	1.27
David R. Karger	2	19367	2233.64

1