Title
Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study
Abstract
AbstractAs data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.
Year
DOI
Venue
2021
10.1145/3456873
ACM Transactions on Software Engineering and Methodology
Keywords
DocType
Volume
API selection, data manipulation, performance optimization, mining software repository, empirical study
Journal
30
Issue
ISSN
Citations 
4
1049-331X
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Yida Tao11386.29
Shan Tang200.34
Yepang Liu341524.58
Zhiwu Xu45811.32
Shengchao Qin571162.81