Title
Insights on Apache Spark Usage by Mining Stack Overflow Questions
Abstract
Apache Spark is one of the most popular big data tools. Despite its popularity, there are no studies regarding its overall usage among software developers. As such, essential questions remain unanswered. For instance, it is not known what the common issues faced by Spark users are, what the most popular Spark libraries are, or what technologies are most commonly used together with Spark. In this paper, we mine Stack Overflow questions and try to shed some light into the above issues. Specifically, we first apply Latent Dirichlet Allocation (LDA) to Stack Overflow questions and obtain the main topics of discussion. By computing previously proposed metrics and a novel modification, we provide insights into Spark usage while taking question view count into account. Further insights are then given by applying newly proposed metrics to the question tags. Temporal trends are finally discussed after analyzing the proposed metrics over time.
Year
DOI
Venue
2018
10.1109/BigDataCongress.2018.00037
2018 IEEE International Congress on Big Data (BigData Congress)
Keywords
Field
DocType
Apache Spark,mining software repositories,Stack Overflow,topic modeling
Data science,Latent Dirichlet allocation,Spark (mathematics),Computer science,Popularity,Software,Stack overflow,Big data,Database,Market research
Conference
ISSN
ISBN
Citations 
2379-7703
978-1-5386-7233-4
1
PageRank 
References 
Authors
0.35
0
3
Name
Order
Citations
PageRank
Leonardo Jiménez Rodríguez110.35
Xiaoran Wang210.69
Jilong Kuang33817.00