Language Modeling Is Compression

No cover — Language Modeling Is Compression (2023, Arxiv)

Published Sept. 19, 2023 by Arxiv.

(1 review)

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to …

It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.

1 edition

Jacob T. reviewed Language Modeling Is Compression by Grégoire Delétang

Really nice way to formalize a collective intuition

4 stars

This paper formally equates (lossless) compression algorithms with LLM/learning. While the Hutter Prize has postulated the connection, this paper shows how an LLM can act as a better compressor for multi-modality data than the domain specific standards of today. The authors also use the gzip compression algorithm as a generative model, with rather poor success, but build a mathematical framework to build on.

The paper also covers tokenization as compression, which is something that's been lacking in a lot of other scientific discourse on this subject. Overall a nice read, 4* only because it ends abruptly without fully exploring the space of compressors as generative models.