The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
Publications of SPCL
|P. Scheffler, F. Zaruba, F. Schuiki, T. Hoefler, L. Benini:|
|Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra|
(In Proceedings of Design, Automation, and Test in Europe (DATE), 2021)
AbstractSparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix product kernels using our hardware, enabling single-core FPU utilizations of up to 80% and speedups of up to 7.2x over an optimized baseline without extensions. A matrix-vector implementation on a multi-core cluster is up to 5.8x faster and 2.7x more energy-efficient with our kernels than an optimized baseline. We propose further uses for our indirection hardware, such as scatter-gather operations and codebook decoding, and compare our work to state-of-the-art CPU, GPU, and accelerator approaches, measuring a 2.8x higher peak FP64 utilization in CSR matrix-vector multiplication than a GTX 1080 Ti GPU running a cuSPARSE kernel.
access preprint on arxiv:
Recorded talk (best effort)