ParLECH: Parallel Long-Read Error Correction with Hadoop - Citegraph

Paper Info

Title
ParLECH: Parallel Long-Read Error Correction with Hadoop

Abstract
Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours.

Year	DOI	Venue
2018	10.1109/BIBM.2018.8621549	2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Keywords	Field	DocType
ParLECH,parallel long-read error correction,short length limitation,second-generation sequencing,sequencing market,sequencing cost,distributed hybrid error correction framework,Illumina short reads,low error rate,sequencing technology,k-mer coverage information,distributed NoSQL system,widest path algorithm,human genome PacBio dataset,long-read sequencing	Computer science,Parallel computing,Word error rate,Error detection and correction,NoSQL,De Bruijn graph,Artificial intelligence,Machine learning,Scalability	Conference
ISSN	ISBN	Citations
2156-1125	978-1-5386-5489-7	0
PageRank	References	Authors
0.34	0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Arghya Kusum Das	1	6	2.18
Kisung Lee	2	342	27.05
Seung-Jong Park	3	319	31.12

1