TEXT: Automatic Template Extraction from Heterogeneous Web Pages - Citegraph

Paper Info

Title
TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Abstract
World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.

Year	DOI	Venue
2011	10.1109/TKDE.2010.140	IEEE Trans. Knowl. Data Eng.
Keywords	Field	DocType
automatic template extraction,minimum description length principle,world wide web,web applications,template detection algorithm,web documents,common template,template extraction,text,feature extraction,minhash.,internet,underlying template structure,novel goodness measure,template detection technique,heterogeneous web pages,web application,web document,present novel algorithm,search engines,heterogeneous template,clustering,data model,web pages,search engine,minhash,merging,indexing terms,data models,clustering algorithms,html,data mining,xml	Data modeling,Data mining,MinHash,Web page,Information retrieval,XML,Computer science,Feature extraction,Web application,Template,Cluster analysis	Journal
Volume	Issue	ISSN
23	4	1041-4347
Citations	PageRank	References
13	0.88	18
Authors
2

Authors (2 rows)

Cited by (13 rows)

References (18 rows)

Name	Order	Citations	PageRank
Chulyun Kim	1	105	7.77
Kyuseok Shim	2	5120	752.19

1