Title
High Dimensional Data Clustering By Means Of Distributed Dirichlet Process Mixture Models
Abstract
Clustering is a data milling technique intensively used for data analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of discovering the number of clusters automatically and offering favorable characteristics. However, in the case of high dimensional data, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with high dimensional data. We propose HD4C (High Dimensional Data Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of high dimensional data such as time series (as a function of time), hyperspectral data (as a function of wavelength) etc. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach.
Year
DOI
Venue
2019
10.1109/BigData47090.2019.9006065
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords
Field
DocType
Gaussian random process, Dirichlet Process Mixture Model, Clustering, Parallelism, Reproducing Kernel Hilbert Space
Data mining,Cluster (physics),Clustering high-dimensional data,Data analysis,Computer science,Curse of dimensionality,Hyperspectral imaging,Dirichlet distribution,Cluster analysis,Reproducing kernel Hilbert space
Conference
ISSN
Citations 
PageRank 
2639-1589
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Khadidja Meguelati100.34
Benedicte Fontez200.34
Nadine Hilgert300.34
Florent Masseglia440843.08