A statistical approach to URL-based web page clustering - Citegraph

Paper Info

Title
A statistical approach to URL-based web page clustering

Abstract
Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.

Year	DOI	Venue
2012	10.1145/2187980.2188109	WWW (Companion Volume)
Keywords	Field	DocType
statistical approach,web page classifier,cluster web page,page content,small subset,url-based web page clustering,good precision,web pages	Static web page,Same-origin policy,World Wide Web,HITS algorithm,URL redirection,Information retrieval,Web page,Computer science,URL normalization,Backlink,Page view	Conference
Citations	PageRank	References
4	0.42	7
Authors
4

Authors (4 rows)

Cited by (4 rows)

References (7 rows)

Name	Order	Citations	PageRank
Inma Hernández	1	76	10.72
Carlos R. Rivero	2	111	16.25
David Ruiz	3	152	20.62
Rafael Corchuelo	4	389	49.87

1