Abstract | ||
---|---|---|
Long-read sequencing is emerging as a promising sequencing technology because it can tackle the short length limitation of second-generation sequencing, which has dominated the sequencing market in past years. However, it has substantially higher error rates compared to short-read sequencing (e.g., 13% vs. 0.1%), and its sequencing cost per base is typically more expensive than that of short-read sequencing. To address these limitations, we present a distributed hybrid error correction framework, called ParLECH, that is scalable and cost-efficient for PacBio long reads. For correcting the errors in the long reads, ParLECH utilizes the Illumina short reads that have the low error rate with high coverage at low cost. To efficiently analyze the high-throughput Illumina short reads, ParLECH is equipped with Hadoop and a distributed NoSQL system. To further improve the accuracy, ParLECH utilizes the k-mer coverage information of the Illumina short reads. Specifically, we develop a distributed version of the widest path algorithm, which maximizes the minimum k-mer coverage in a path of the de Bruijn graph constructed from the Illumina short reads. We replace an error region in a long read with its corresponding widest path. Our experimental results show that ParLECH can handle large-scale real-world datasets in a scalable and accurate manner. Using ParLECH, we can process a 312 GB human genome PacBio dataset, with a 452 GB Illumina dataset, on 128 nodes in less than 29 hours. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/BIBM.2018.8621549 | 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) |
Keywords | Field | DocType |
ParLECH,parallel long-read error correction,short length limitation,second-generation sequencing,sequencing market,sequencing cost,distributed hybrid error correction framework,Illumina short reads,low error rate,sequencing technology,k-mer coverage information,distributed NoSQL system,widest path algorithm,human genome PacBio dataset,long-read sequencing | Computer science,Parallel computing,Word error rate,Error detection and correction,NoSQL,De Bruijn graph,Artificial intelligence,Machine learning,Scalability | Conference |
ISSN | ISBN | Citations |
2156-1125 | 978-1-5386-5489-7 | 0 |
PageRank | References | Authors |
0.34 | 0 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Arghya Kusum Das | 1 | 6 | 2.18 |
Kisung Lee | 2 | 342 | 27.05 |
Seung-Jong Park | 3 | 319 | 31.12 |