SPCL - Publications

Conference Rankings

SPONSORS

COLLABORATIONS

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

G. Kwasniewski, M. Kabić, T. Ben-Nun, A. Nikolaos Ziogas, J. Eirik Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, T. Hoefler:

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), Nov. 2021)

Publisher Reference

Abstract

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

Documents

download article:

access preprint on arxiv:

Recorded talk (best effort)

BibTeX

@inproceedings{,
  author={Grzegorz Kwasniewski and Marko Kabić and Tal Ben-Nun and Alexandros Nikolaos Ziogas and Jens Eirik Saethre and André Gaillard and Timo Schneider and Maciej Besta and Anton Kozhevnikov and Joost VandeVondele and Torsten Hoefler},
  title={{On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations}},
  year={2021},
  month={11},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC21)},
  doi={10.1145/3458817.3476167},
}