FlashR: parallelize and scale R for machine learning using SSDs. - Citegraph

Paper Info

Title
FlashR: parallelize and scale R for machine learning using SSDs.

Abstract
R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.

Year	DOI	Venue
2018	10.1145/3178487.3178501	PPOPP
Keywords	Field	DocType
R, machine learning, parallel, solid-state drives	Data point,Memory hierarchy,Spark (mathematics),Computer science,Matrix function,Parallel computing,Fortran,Artificial intelligence,Matrix multiplication,Machine learning,Speedup,Computation	Conference
Volume	Issue	ISSN
53	1	0362-1340
ISBN	Citations	PageRank
978-1-4503-4982-6	0	0.34
References	Authors
24	5

Authors (5 rows)

Cited by (0 rows)

References (24 rows)

Name	Order	Citations	PageRank
Da Zheng	1	6	2.49
Disa Mhembere	2	63	5.42
Joshua T. Vogelstein	3	273	31.99
Carey E. Priebe	4	505	108.56
Randal C. Burns	5	8	4.19

1