Title
Clustering Web Pages Based on Structure and Style Similarity (Application Paper).
Abstract
We consider cluster analysis task on web pages based on various techniques to group the pages. While grouping the web pages based on the semantic meaning expressed in the content is required for some applications, we focus on clustering based on the web page structure and style for applications like categorization, cleaning, schema detection and automatic extractions. This paper describes some of the applications of similarity measures and a clustering technique to group the web pages into clusters. The structural similarity of HTML pages is measured by using Tree Edit Distance measure on DOM trees. The stylistic similarity is measured by using Jaccard similarity on CSS class names. An aggregated similarity measure is computed by combining structural and stylistic measures. A clustering method is then applied to this aggregated similarity measure to group the documents.
Year
Venue
Field
2016
IRI
Data mining,Data modeling,Categorization,Information retrieval,Web page,Similarity measure,Computer science,Cascading Style Sheets,Jaccard index,Cluster analysis,Schema (psychology)
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Thamme Gowda111.06
Chris A. Mattmann220025.39