Copyright Notice:
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Publications of SPCL
S. Shen, L. Huang, M. Chrapek, T. Schneider, J. Dayal, M. Gajbe, R. Wisniewski, T. Hoefler: | ||
LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming (In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24), presented in Atlanta, GA, USA, pages 1004-1021, IEEE Press, ISBN: 979-8-3503-5291-7, Nov. 2024) SC'24 Best Paper Award (1/99) Publisher Reference AbstractThe shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. As large-scale MPI applications often exhibit significant differences in their network latency tolerance, it is crucial to accurately determine the extent of network latency an application can withstand without significant performance degradation. Current approaches to assessing this metric often rely on specialized hardware or network simulators, which can be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain that offers an efficient, analytical approach to evaluating HPC applications' network latency tolerance using the LogGPS model and linear programming. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. Through our validation on a variety of MPI applications like MILC, LULESH, and LAMMPS, we demonstrate our tool's high accuracy, with relative prediction errors generally below 2%. Additionally, we include a case study of the ICON weather and climate model to illustrate LLAMP's broad applicability in evaluating collective algorithms and network topologies.Documentsdownload article:access preprint on arxiv: download slides: | ||
BibTeX | ||
|