Title
GenStore: In-Storage Filtering of Genomic Data for High-Performance and Energy-Efficient Genome Analysis
Abstract
Genome sequence analysis, which analyzes the DNA sequences of organisms, is important for many applications in personalized medicine [1]–[8], outbreak tracing [9]–[14], and evolutionary studies [15]–[21]. The information of an organism's DNA is converted to digital data via a process called sequencing. A sequencing machine extracts the sequences of DNA molecules from the organism's sample in the form of strings consisting of four base pairs <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(bps)$</tex> , denoted by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathrm{A}, \mathrm{C}, \mathrm{G}$</tex> , and T. No current sequencing technology has the capability to read a human DNA molecule in its entirety. Instead, state-of-the-art sequencing machines generate randomly sampled, inexact sub-strings of the original genome, called reads. The information about the corresponding location of each read in the complete genome is lost during sequencing in most technologies. State-of-the-art sequencing machines produce one of two kinds of reads. 1) Short read sequencing technologies, such as Illumina [22], [23], produce reads that are highly accurate (99-99.9%) [24]–[26], but short (e.g., up to a few hundred DNA base pairs [24], [27], [28]). 2) Long read sequencing technologies, such as Pacific Biosciences (PacBio) [29] and Oxford Nanopore Technologies (ONT) [30], produce reads that are less accurate (85-90%) [27,31–33], but long (e.g., lengths ranging from thousands to millions of base pairs [34]).
Year
DOI
Venue
2022
10.1109/ISVLSI54635.2022.00062
2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)
Keywords
DocType
ISSN
Near Data Processing,Read Mapping,Filtering,Genomics,Storage
Conference
2159-3469
ISBN
Citations 
PageRank 
978-1-6654-6606-6
0
0.34
References 
Authors
44
14