Title
CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework
Abstract
Next-generation sequencing (NGS) technologies produce huge amounts of data. These sequencing data unavoidably are accompanied by the occurrence of sequencing errors which constitutes one of the major problems of further analyses. Error correction is indeed one of the critical steps to the success of NGS applications such as de novo genome assembly and DNA resequencing as illustrated in literature. However, it requires computing time and memory space heavily. To design an algorithm to improve data quality by efficiently utilizing on-demand computing resources in the cloud is a challenge for biologists and computer scientists. In this study, we present an error-correction algorithm, called the CloudRS algorithm, for correcting errors in NGS data. The CloudRS algorithm aims at emulating the notion of error correction algorithm of ALLPATHS-LG on the Hadoop/ MapReduce framework. It is conservative in correcting sequencing errors to avoid introducing false decisions, e.g., when dealing with reads from repetitive regions. We also illustrate several probabilistic measures we introduce into CloudRS to make the algorithm more efficient without sacrificing its effectiveness. Running time of using up to 80 instances each with 8 computing units shows satisfactory speedup. Experiments of comparing with other error correction programs show that CloudRS algorithm performs lower false positive rate for most evaluation benchmarks and higher sensitivity on genome S. cerevisiae. We demonstrate that CloudRS algorithm provides significant improvements in the quality of the resulting contigs on benchmarks of NGS de novo assembly.
Year
DOI
Venue
2013
10.1109/BigData.2013.6691642
BigData Conference
Keywords
Field
DocType
next-generation sequencing,error correction algorithm,mapreduce,allpaths-lg,cloudrs,biologists,scalable framework,hadoop/mapreduce framework,data quality,biology computing,computer scientists,high-throughput sequencing data,cloud computing,genome assembly,on-demand computing resources,error correction,ngs technologies
False positive rate,Data mining,Data quality,Computer science,Algorithm,Error detection and correction,Probabilistic logic,Sequence assembly,Speedup,Cloud computing,Scalability
Conference
ISSN
Citations 
PageRank 
2639-1589
4
0.40
References 
Authors
10
5
Name
Order
Citations
PageRank
Chien-Chih Chen111120.42
Yu-Jung Chang211912.09
Wei-Chun Chung362.79
D.T. Lee462778.14
Jan-Ming Ho5950106.64