Title
FlashR: parallelize and scale R for machine learning using SSDs.
Abstract
R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.
Year
DOI
Venue
2018
10.1145/3178487.3178501
PPOPP
Keywords
Field
DocType
R, machine learning, parallel, solid-state drives
Data point,Memory hierarchy,Spark (mathematics),Computer science,Matrix function,Parallel computing,Fortran,Artificial intelligence,Matrix multiplication,Machine learning,Speedup,Computation
Conference
Volume
Issue
ISSN
53
1
0362-1340
ISBN
Citations 
PageRank 
978-1-4503-4982-6
0
0.34
References 
Authors
24
5
Name
Order
Citations
PageRank
Da Zheng162.49
Disa Mhembere2635.42
Joshua T. Vogelstein327331.99
Carey E. Priebe4505108.56
Randal C. Burns584.19