Copyright Notice:
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Publications of SPCL
Y. Jin, H. Wang, X. Tang, Z. Guo, Y. Zhao, T. Hoefler: | ||
Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications (IEEE Transactions on Parallel and Distributed Systems. Vol 36, Nr. 2, In , pages 308-325, ISSN: , 2025) Publisher Reference AbstractIt is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present ScalAna, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. ScalAna uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of ScalAna using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.Documents | ||
BibTeX | ||
|