Using XPath to Discover Informative Content Blocks of Web Pages - Citegraph

Paper Info

Title
Using XPath to Discover Informative Content Blocks of Web Pages

Abstract
Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define relevant contents as informative content blocks, whereas irrelevant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of traditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five different Web sites, and output more simpler and centralized HTML file. Experimental result shows that our method can obtain informative content blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.

Year	DOI	Venue
2007	10.1109/SKG.2007.106	SKG
Keywords	Field	DocType
informative content blocks,web page,xml,high link-based ranking,informative con,various content,clutters,different web site,centralized html file,ditional information retrieval,xpath,web sites,web pages,discover informative content blocks,web site,html file,informative content block,vant content,search engine,information retrieval,information content	Static web page,Data mining,Site map,Search engine,Web mining,Web page,Information retrieval,XML,Ranking,Computer science,XPath	Conference
ISBN	Citations	PageRank
978-0-7695-3007-9	2	0.43
References	Authors
11	5

Authors (5 rows)

Cited by (2 rows)

References (11 rows)

Name	Order	Citations	PageRank
Yan Fu	1	3	4.17
YANG Dong-Qing	2	975	201.51
Shiwei Tang	3	478	51.52
WANG Teng-Jiao	4	352	48.09
Jun Gao	5	245	25.52

1