Title
Parallel Clustering of Single Cell Transcriptomic Data with Split-Merge Sampling on Dirichlet Process Mixtures.
Abstract
Motivation With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (i) the clustering quality still needs to be improved; (ii) most models need prior knowledge on number of clusters, which is not always available; (iii) there is a demand for faster computational speed. Results We propose to tackle these challenges with Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. Availability and implementation Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package. Supplementary information Supplementary data are available at Bioinformatics online.
Year
DOI
Venue
2018
10.1093/bioinformatics/bty702
BIOINFORMATICS
Field
DocType
Volume
Convergence (routing),Data mining,Cluster (physics),Dirichlet process,Supercomputer,Source code,Computer science,Sampling (statistics),Merge (version control),Cluster analysis
Journal
35
Issue
ISSN
Citations 
6
1367-4803
1
PageRank 
References 
Authors
0.40
4
3
Name
Order
Citations
PageRank
Tiehang Duan110.40
José Pinto2236.89
Xiaohui Xie3615.50