Title
ParLECH: Parallel Long-Read Error Correction with Hadoop
Abstract
Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours.
Year
DOI
Venue
2018
10.1109/BIBM.2018.8621549
2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Keywords
Field
DocType
ParLECH,parallel long-read error correction,short length limitation,second-generation sequencing,sequencing market,sequencing cost,distributed hybrid error correction framework,Illumina short reads,low error rate,sequencing technology,k-mer coverage information,distributed NoSQL system,widest path algorithm,human genome PacBio dataset,long-read sequencing
Computer science,Parallel computing,Word error rate,Error detection and correction,NoSQL,De Bruijn graph,Artificial intelligence,Machine learning,Scalability
Conference
ISSN
ISBN
Citations 
2156-1125
978-1-5386-5489-7
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Arghya Kusum Das162.18
Kisung Lee234227.05
Seung-Jong Park331931.12