Title
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.
Abstract
Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.
Year
DOI
Venue
2014
10.1093/bioinformatics/btt601
BIOINFORMATICS
Keywords
Field
DocType
software design
Data mining,Data processing,Data set,Software design,Computer science,MIT License,Bioinformatics,Java,Database,Scalability,Scripting language
Journal
Volume
Issue
ISSN
30
1
1367-4803
Citations 
PageRank 
References 
33
1.45
7
Authors
7
Name
Order
Citations
PageRank
André Schumacher1717.26
Luca Pireddu210010.01
Matti Niemenmaa3653.91
Aleksi Kallio4855.75
Eija Korpelainen51038.95
gianluigi zanetti620829.13
Keijo Heljanko775147.90