The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.
But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.
$ curl https://www.gutenberg.org/cache/epub/11/pg11.txt > text.txt
$ split -n 500 text.txt trainpart.
Using a normal compression algorithm: $ zstd --train trainpart.* -o dictionary
Save dictionary of size 112640 into file dictionary
$ zstd -vD dictionary text.txt
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
text.txt : 15.41% ( 170 KiB => 26.2 KiB, text.txt.zst)
For this example, ZSTD warns that the dictionary training set is 10X-100X too small to be efficient. Realistically, I guess you'd train it over E.G. the entire Gutenberg library. Then you can distribute specific books to people who already have the dictionary.Or:
$ curl -L https://archive.org/download/completeworksofl1920carr/completeworksofl1920carr_hocr_searchtext.txt.gz |
gzip -d |
sed -E 's/\s+/ /g' > FullTextsSample.txt
$ zstd -v -19 --patch-from FullTextsSample.txt text.txt
text.txt : 16.50% ( 170 KiB => 28.1 KiB, text.txt.zst)
Not sure how much performance would drop for realistic use. But there are also some knobs you can tune.Refer to:
https://github.com/facebook/zstd/#dictionary-compression-how...
https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
$ man zstd
- Dictionary occupies only kilobytes or megabytes of storage, instead of gigabytes or terabytes.- Dictionary can be re-trained for specific data at negligble cost.
- Compression and decompression are deterministic by default.
- Doesn't take large amount of GPU resources to compress/decompression.
- This is actually designed to do this.
(To your point, one of those measures isn't including gigabytes of LLM in its size savings, as if it's part of the .exe size instead.)
* EDIT to link to discussion further down: https://news.ycombinator.com/item?id=40245530
Yeah. But I don't think it's hinting at any fundamental theoretical limit.
Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).
So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.
In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.
I think a better comparison should probably:
- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.
- Use the same training data for both the LLM and conventional compressor.
- Include the dictionary/model size. And compare methods at the same dictionary/model size.
Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.
Interestingly (though maybe not relevant here) you can also get different results from multi-user inference systems depending on what other requests are in the batch. It's possible to avoid this but I'm pretty sure most systems don't.
The "slightly different" bit of course makes it worse - it will work 99% of the time.
People have found other ways to do that of course, but this is pretty clever.
Imagine in your corpus of training data is the following:
- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"
- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."
- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"
Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"
"According to the New York Times, it rains cats and dogs twice per year."
NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.
Zoom that out to full articles quoted throughout the web, and you get false positives.
If Stack Overflow collects a bunch of questions and comments and expose them as a big dataset licensed as Creative Commons but it actually contains a quite bit of copyrighted content, whose responsibility is it to validate copyright violations in that data? If I use something licensed as CC in good faith and it turns out the provider or seller of that content had no right to relicense it, am I culpable? Is this just a new lawsuit where I can seek damages for the lawsuit I just lost?
> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!
> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.
What I'm getting at is that it's plausible that a LLM is trained purely on things that were available and licensed as Creative Commons but that the data within contains copyrighted content because someone who contributed to it lied about their ownership rights to provide that content under a Creative Commons license, i.e. StackOverflow user UnicornWitness24 is the perpetrator of the copyright violation by copying a NYT article into a reply to bypass a paywall for other users and has now poisoned a dataset. And I'm asking: What is the civil liability for copyright violations if the defendant was the one who was actually defrauded or deceived and was acting in good faith and within the bounds of the law at the time?
it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.
But yes, open to interpretation as far as where LLM training falls.
174355 pg11.txt
60907 pg11.txt.gz-9
58590 pg11.txt.zstd-9
54164 pg11.txt.xz-9
25360 [from blog post]
You can convert any predictor into a lossless compressor by feeding the output probabilities into an entropy coding algorithm. LLMs can get compression ratios as high as 95% (0.4 bits per character) on english text.
There's no sense, for example, in which deriving a prediction about the nature of reality from a novel scientific theory is 'compression'
eg., suppose we didn't know a planet existed, and we looked at orbital data. There's no sense in which compressing that data would indicate another planet existed.
It's a great source of confusion that people think AI/ML systems are 'predicting' novel distributions of observations (science), vs., novel observations of the same distribution (statistics).
It should be more obvious that the latter is just compression, since it's just taking a known distribution of data and replacing it with a derivative optimal value.
Science predicts novel distributions based on theories, ie., it says the world is other than we previously supposed.
Predictions of novel objects derivative from scientific theories arent quantitative data points.
A statistical model of orbits, without a theory of gravity, is less compressed when you assume more objects. Take all the apparent positions of objects in the sky, {(object, x1, x2, t),...}. Find a statistical model of each point at t+1, so y = (o, x1, x2, t+1). There is no sense in which you're deriving a new object in the sky from this statistical model -- it is only a compression of observable orbits.
When you say, "if you have the new planet", you're changing the data generating process (theory) to produce a new distribution of points {(o' x1', x2', t'), ...} to include an unseen object. You're then comparing two data generating models (two theories) for their simplicity. You're not comparing the associative models.
Call the prior theory 8-planets, so 8P generates x1,x2,t; and the new theory 9P which generates x1',x2',t'
You're then making a conditional error distribution when comparing two rival theories. The 9P theory will minimize this error.
But in no sense can the 9P theory be derived from the initial associative statistical distribution. You are, based on theory (, science, knowledge, etc.) choosing to add a planet, vs. eg., correcting for measurement error, modifiying newton's laws, changing the angle of the earth wrt the solar system... or one of an infinite number of theories which all produce the same error minimization
The sense of "prediction" that science uses (via Popper et al.) is deriving the existence of novel phenomena that do not follow from prior observable distributions.
You want a statistical model that produces a theory of gravity.
By the way, if you want to see how well gzip actually models language, take any gzipped file, flip a few bits, and unzip it. If it gives you a checksum error, ignore that. You might have to unzip in a streaming way so that it can't tell the checksum is wrong until it's already printed the wrong data that you want to see.
$ curl https://www.gutenberg.org/cache/epub/11/pg11.txt | bzip2 --best | wc
246 1183 48925
Also, ZSTD goes all the way up to `--ultra -22` plus `--long=31` (4GB window— Irrelevant here since the file fits in the default 8MB anyway).llama.cpp also has a link to the model
But sure, it's a constant factor, so if you compress enough data you can always ignore it.
But I do want to point out that almost everyone installs at least one multigigabyte file to decompress other files, and that is the OS.
This is just the gigascale version of that.
It’s easier to transmit a story than the entire sensory experience of something happening, so it saves us time and attention. Being able to convert complex tasks and ideas into short stories has likely served us well since the brain burns a lot of calories.
We need compression because internally cognition has a very "wide" set of inputs/outputs (basically meshed), but we only have a serial audio link with relatively narrow bandwidths, so the whole thing that allows us to even discuss this is basically a neat evolutionary hack.