Template-independent news extraction based on visual consistency - Citegraph

Paper Info

Title
Template-independent news extraction based on visual consistency

Abstract
Wrapper is a traditional method to extract useful information from Web pages. Most previous works rely on the similarity between HTML tag trees and induced template-dependent wrappers. When hundreds of information sources need to be extracted in a specific domain like news, it is costly to generate and maintain the wrappers. In this paper, we propose a novel template-independent news extraction approach to easily identify news articles based on visual consistency. We first represent a page as a visual block tree. Then, by extracting a series of visual features, we can derive a composite visual feature set that is stable in the news domain. Finally, we use a machine learning approach to generate a template-independent wrapper. Experimental results indicate that our approach is effective in extracting news across websites, even from unseen websites. The performance is as high as around 95% in terms of F1-value.

Year	Venue	Keywords
2007	AAAI	information source,composite visual feature set,novel template-independent news extraction,specific domain,news domain,template-independent news extraction,induced template-dependent wrapper,visual feature,visual consistency,news article,visual block tree,web pages,machine learning
Field	DocType	Citations
HTML element,Information retrieval,Biconnected component,Web page,Computer science,Feature set	Conference	32
PageRank	References	Authors
1.37	15	3

Authors (3 rows)

Cited by (32 rows)

References (15 rows)

Name	Order	Citations	PageRank
Shuyi Zheng	1	256	11.22
Ruihua Song	2	1138	59.33
Ji-Rong Wen	3	4431	265.98

1