Title
Using geometric structures to improve the error correction algorithm of high-throughput sequencing data on MapReduce framework
Abstract
Next-generation sequencing (NGS) data are a rapidly growing example of big data and a source of new knowledge in science. However, sequencing errors remain unavoidable and reduce the quality of NGS data. Error correction, therefore, is a critical step in the successful utilization of NGS data, including de novo genome assembly and DNA resequencing. Since NGS throughput doubles approximately every five months and the length of NGS records (i.e., reads) is increasing, improvements in efficiency and effectiveness of computational strategies are needed. In this study, we aim to improve the performance of CloudRS, an open-source MapReduce application designed to correct sequencing errors in NGS data. We introduce the readmessage (RM) diagram to represent the set of messages, i.e., the key-value pairs generated on each read. We also present the Gradient-number Votes (GNV) scheme in order to trim off portions of the RM diagram, thereby reducing the total size of messages associated with each read. Experimental results show that the GNV scheme successfully reduce execution time and improve the quality of the de novo genome assembly.
Year
DOI
Venue
2014
10.1109/BigData.2014.7004306
BigData Conference
Keywords
Field
DocType
mapreduce framework,geometric structures,ngs data,diagrams,next-generation sequencing,big data,next-generation sequencing data,error correction algorithm,mapreduce,readmessage diagram,genetics,cloudrs,geometric structure,gradient-number votes,gnv,rm diagram,bioinformatics,error correction,next generation sequencing
Data mining,Trim,Computer science,Error detection and correction,Execution time,DNA sequencing,Throughput,Big data,DNA Resequencing,Sequence assembly
Conference
ISSN
Citations 
PageRank 
2639-1589
0
0.34
References 
Authors
18
4
Name
Order
Citations
PageRank
Wei-Chun Chung162.79
Yu-Jung Chang211912.09
D.T. Lee362778.14
Jan-Ming Ho4950106.64