Insights on Apache Spark Usage by Mining Stack Overflow Questions - Citegraph

Paper Info

Title
Insights on Apache Spark Usage by Mining Stack Overflow Questions

Abstract
Apache Spark is one of the most popular big data tools. Despite its popularity, there are no studies regarding its overall usage among software developers. As such, essential questions remain unanswered. For instance, it is not known what the common issues faced by Spark users are, what the most popular Spark libraries are, or what technologies are most commonly used together with Spark. In this paper, we mine Stack Overflow questions and try to shed some light into the above issues. Specifically, we first apply Latent Dirichlet Allocation (LDA) to Stack Overflow questions and obtain the main topics of discussion. By computing previously proposed metrics and a novel modification, we provide insights into Spark usage while taking question view count into account. Further insights are then given by applying newly proposed metrics to the question tags. Temporal trends are finally discussed after analyzing the proposed metrics over time.

Year	DOI	Venue
2018	10.1109/BigDataCongress.2018.00037	2018 IEEE International Congress on Big Data (BigData Congress)
Keywords	Field	DocType
Apache Spark,mining software repositories,Stack Overflow,topic modeling	Data science,Latent Dirichlet allocation,Spark (mathematics),Computer science,Popularity,Software,Stack overflow,Big data,Database,Market research	Conference
ISSN	ISBN	Citations
2379-7703	978-1-5386-7233-4	1
PageRank	References	Authors
0.35	0	3

Authors (3 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Leonardo Jiménez Rodríguez	1	1	0.35
Xiaoran Wang	2	1	0.69
Jilong Kuang	3	38	17.00

1