Title
Geodesic distances for web document clustering
Abstract
While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.
Year
DOI
Venue
2011
10.1109/CIDM.2011.5949449
Computational Intelligence and Data Mining
Keywords
Field
DocType
document handling,information retrieval systems,pattern clustering,English Wikipedia hyperlinked data,Web document clustering,clustering coefficient value,cosine distance,document analysis,geodesic distance,information retrieval systems,k-means algorithm,term-document matrix
Hierarchical clustering,k-medians clustering,Fuzzy clustering,Pattern recognition,Correlation clustering,Computer science,Artificial intelligence,Clustering coefficient,Cluster analysis,Geodesic,Distance measures
Conference
ISBN
Citations 
PageRank 
978-1-4244-9926-7
0
0.34
References 
Authors
8
3
Name
Order
Citations
PageRank
Selma Tekir101.35
Florian Mansmann258935.91
Daniel A. Keim377041141.60