Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

S. Ashkboos, M. L. Croci, M. Gennari do Nascimento, T. Hoefler, J. Hensman:

 SliceGPT: Compress Large Language Models by Deleting Rows and Columns

(In The Twelfth International Conference on Learning Representations, May 2024)

Abstract

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models.

Documents

download article:
access preprint on arxiv:
 

BibTeX

@inproceedings{,
  author={Saleh Ashkboos and Maximilian L. Croci and Marcelo Gennari do Nascimento and Torsten Hoefler and James Hensman},
  title={{SliceGPT: Compress Large Language Models by Deleting Rows and Columns}},
  year={2024},
  month={05},
  booktitle={The Twelfth International Conference on Learning Representations},
  doi={10.48550/arXiv.2401.15024},
}