Title
SystemML: Declarative Machine Learning on Spark.
Abstract
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
Year
DOI
Venue
2016
10.14778/3007263.3007279
PVLDB
Field
DocType
Volume
Linear algebra,Data mining,Spark (mathematics),Programming language,Computer science,End-to-end principle,Testbed,Artificial intelligence,Porting,Machine learning,Database
Journal
9
Issue
ISSN
Citations 
13
2150-8097
38
PageRank 
References 
Authors
0.91
27
11