Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

P. Okanovic, G. Kwasniewski, P. Sylos Labini, M. Besta, F. Vella, T. Hoefler:

 High Performance Unstructured SpMM Computation Using Tensor Cores

(In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24), presented in Atlanta, GA, USA, pages 154:1-154:14, IEEE Press, ISBN: 979-8-3503-5291-7, Nov. 2024)

Publisher Reference

Abstract

High-performance sparse matrix-matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation, further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large model training, inference, and others.

Documents

download article:
access preprint on arxiv:
 

BibTeX

@inproceedings{okanovic2024high,
  author={Patrik Okanovic and Grzegorz Kwasniewski and Paolo Sylos Labini and Maciej Besta and Flavio Vella and Torsten Hoefler},
  title={{High Performance Unstructured SpMM Computation Using Tensor Cores}},
  year={2024},
  month={11},
  pages={154:1-154:14},
  booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)},
  location={Atlanta, GA, USA},
  publisher={IEEE Press},
  isbn={979-8-3503-5291-7},
  doi={10.1109/SC41406.2024.00060},
}