Title
DataLab: a version data management and analytics system.
Abstract
One challenge in big data analytics is the lack of tools to manage the complex interactions among code, data and parameters, especially in the common situation where all these factors can change a lot. We present our preliminary experience with DataLab, a system we build to manage the big data workflow. DataLab improves big data analytical workflow in several novel ways. 1) DataLab manages the revision of both code and data in a coherent system, and includes a distributed code execution engine to run users' code; 2) DataLab keeps track of all the data analytics results in a data work flow graph, and is able to compare the code / results between any two versions, making it easier for users to intuitively see the results of their code change; 3) DataLab provides an efficient data management system to separate data from their metadata, allowing efficient preprocessing filters; and 4) DataLab provides a common API so people can build different applications on top of it. We also present our experience of applying a DataLab prototype in a real bioinformatics application.
Year
DOI
Venue
2016
10.1145/2896825.296830
BIGDSE@ICSE
Keywords
Field
DocType
software engineering,data management,data analytics,version control
Metadata,Software analytics,Data analysis,Computer science,Distributed database,Analytics,Data management,Workflow,Big data,Database
Conference
ISBN
Citations 
PageRank 
978-1-4503-4152-3
0
0.34
References 
Authors
12
6
Name
Order
Citations
PageRank
Yang Zhang118943.34
Fangzhou Xu200.68
Erwin Frise3161.14
Siqi Wu4198.73
Bin Yu51984241.03
Wei Xu665641.71