Title
Using sample-based time series data for automated diagnosis of scalability losses in parallel programs.
Abstract
The performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performance tools to monitor and analyze application behavior. Profiles generated by performance tools can usually indicate the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. This paper describes an automated framework that analyzes sample-based time series data to diagnose scalability losses in parallel executions. The framework's automated diagnosis of scalability losses indicates their symptoms, severity, and causes. Two case studies illustrate the effectiveness of this framework. When compared to a tool that analyzes performance using instrumentation-based traces, our overhead for collecting sample-based time series is 1/28 in time and 1/1600 in space while our automated analysis takes 1/25 of the time.
Year
DOI
Venue
2020
10.1145/3332466.3374538
PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming San Diego California February, 2020
Keywords
Field
DocType
Performance, automated diagnosis, scalability losses, sample-based time series data
Time series,Computer science,Distributed computing,Scalability
Conference
ISBN
Citations 
PageRank 
978-1-4503-6818-6
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Lai Wei184.98
John Mellor-Crummey286876.69