Title
Clustering by compression
Abstract
We present a new method for clustering based on compression. The method does not use subject-specific features or background knowledge, and works as follows: First, we determine a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal. However, the optimality comes at the price of using the noncomputable notion of Kolmogorov complexity. We propose axioms to capture the real-world setting, and show that the NCD approximates optimality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (ternary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics, we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.
Year
DOI
Venue
2003
10.1109/TIT.2005.844059
IEEE Transactions on Information Theory
Keywords
DocType
Volume
application area boundary,new quartet method,distance matrix,normalized information distance,hierarchical clustering method,specific application area,new method,new evidence,similarity distance,normalized compression distance,data mining,robustness,mitochondrial genome,mammalian evolution,data compression,sorting,binary tree,hierarchical clustering,astronomy,compressors,dendrogram,genetics,bioinformatics,data analysis,application software,dictionaries,genomics
Journal
51
Issue
ISSN
Citations 
4
0018-9448
288
PageRank 
References 
Authors
16.03
14
2
Search Limit
100288
Name
Order
Citations
PageRank
R. Cilibrasi128816.03
Paul Vitányi22130287.76