Title
Big Data Analytics with Spark
Abstract
Born from a Berkeley graduate project, the Apache Spark library has grown to be the most broadly used big data analytics platform. While Spark integrates with the older Hadoop ecosystem, it provides much more intuitive, faster, and powerful abstractions for manipulating distributed data than MapReduce. In this workshop, we will cover the basics of the Spark library with the goal of getting participants up to speed so that they can use the library or teach it in courses that involve big data or distributed processing. Participants will work with examples that range from calculating basic summary statistics to using the Spark Machine Learning library for performing sophisticated machine learning analyses on large datasets. Tasks during the session will be performed on smaller samples using the Spark local standalone implementation on participant's laptops. We will also discuss how Spark can be run on a local or cloud-based cluster and point participants toward resources for setting up those environments for their students.
Year
DOI
Venue
2020
10.1145/3287324.3287551
Proceedings of the 50th ACM Technical Symposium on Computer Science Education
Keywords
Field
DocType
big data, data science, distributed computing, spark
Abstraction,Spark (mathematics),Computer science,Multimedia,Big data,Cloud computing
Conference
ISBN
Citations 
PageRank 
978-1-4503-5890-3
1
0.39
References 
Authors
0
1
Name
Order
Citations
PageRank
Mark C. Lewis1245.04