Graph sketching-based Massive Data Clustering. - Citegraph

Paper Info

Title
Graph sketching-based Massive Data Clustering.

Abstract
In this paper, we address the problem of recovering arbitrary-shaped data clusters from massive datasets. We present DBMSTClu a new density-based non-parametric method working on a limited number of linear measurements i.e. a sketched version of the similarity graph $G$ between the $N$ objects to cluster. Unlike $k$-means, $k$-medians or $k$-medoids algorithms, it does not fail at distinguishing clusters with particular structures. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. DBMSTClu as a graph-based technique relies on the similarity graph $G$ which costs theoretically $O(N^2)$ in memory. However, our algorithm follows the dynamic semi-streaming model by handling $G$ as a stream of edge weight updates and sketches it in one pass over the data into a compact structure requiring $O(N operatorname{poly} operatorname{log} (N))$ space. Thanks to the property of the Minimum Spanning Tree (MST) for expressing the underlying structure of a graph, our algorithm successfully detects the right number of non-convex clusters by recovering an approximate MST from the graph sketch of $G$. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

Year	Venue	Field
2017	arXiv: Learning	Strength of a graph,Combinatorics,Line graph,Graph property,Graph factorization,Null graph,Artificial intelligence,Graph bandwidth,Clique-width,Butterfly graph,Machine learning,Mathematics
DocType	Volume	Citations
Journal	abs/1703.02375	2
PageRank	References	Authors
0.37	0	4

Authors (4 rows)

Cited by (2 rows)

References (0 rows)

Name	Order	Citations	PageRank
Anne Morvan	1	3	1.42
Krzysztof Choromanski	2	124	23.56
Cédric Gouy-Pailler	3	62	10.69
Jamal Atif	4	309	29.49

1