DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data. - Citegraph

Paper Info

Title
DRUMS: Disk Repository with Update Management and Select option for high throughput sequencing data.

Abstract
New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware.Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment.DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics.

Year	DOI	Venue
2014	10.1186/1471-2105-15-38	BMC Bioinformatics
Keywords	Field	DocType
database management systems,bioinformatics,algorithms,microarrays,endogenous retroviruses,genomics	Computer science,Emerging technologies,Software,DNA sequencing,Genome human,Bioinformatics,Genetics,Big data,Database	Journal
Volume	Issue	ISSN
15	1	1471-2105
Citations	PageRank	References
4	0.34	2
Authors
4

Authors (4 rows)

Cited by (4 rows)

References (2 rows)

Name	Order	Citations	PageRank
Martin Nettling	1	10	3.02
Nils Thieme	2	20	0.98
Andreas Both	3	368	30.03
Ivo Grosse	4	404	37.14

1