FilterHN

Using a LLM to compress text

132 points

by alexmolas

15 days ago

| past

| 17 comments

| o565.com

| HN

▲

blackle

13 days ago

[-]

There is a way to do this same compression by utilizing the raw probability distributions that the LLM produces as output. Fabrice Bellard has some experiments with doing this with transformers: https://bellard.org/nncp/

The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.

▲

FooBarBizBazz

13 days ago

[-]

Exactly this. The way to do this is to use the LLM as the statistical model inside an arithmetic coder. You use its next-token probabilities to encode the next token. The KL divergence between the LLM's probabilities, and the empirical probabilities in the corpus that you actually compress, is the compression inefficiency.

▲

saeranv

12 days ago

[-]

> The idea is that if you can produce an accurate probably distribution over the next bit/byte/token...

But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.

▲

blackle

12 days ago

[-]

My understanding is that minimizing perplexity (what LLMs are generally optimized for) is equivalent to finding a good probably distribution over the next token.

▲

Intralexical

13 days ago

[-]

You can already pre-train compression on text without using an LLM:

    $ curl https://www.gutenberg.org/cache/epub/11/pg11.txt > text.txt
    $ split -n 500 text.txt trainpart.

Using a normal compression algorithm:

    $ zstd --train trainpart.* -o dictionary
    Save dictionary of size 112640 into file dictionary

    $ zstd -vD dictionary text.txt 
    *** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
    text.txt             : 15.41%   (   170 KiB =>   26.2 KiB, text.txt.zst)

For this example, ZSTD warns that the dictionary training set is 10X-100X too small to be efficient. Realistically, I guess you'd train it over E.G. the entire Gutenberg library. Then you can distribute specific books to people who already have the dictionary.

Or:

    $ curl -L https://archive.org/download/completeworksofl1920carr/completeworksofl1920carr_hocr_searchtext.txt.gz |
        gzip -d |
        sed -E 's/\s+/ /g' > FullTextsSample.txt

    $ zstd -v -19 --patch-from FullTextsSample.txt text.txt
    text.txt             : 16.50%   (   170 KiB =>   28.1 KiB, text.txt.zst)

Not sure how much performance would drop for realistic use. But there are also some knobs you can tune.

Refer to:

https://github.com/facebook/zstd/#dictionary-compression-how...

https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...

    $ man zstd

- Dictionary occupies only kilobytes or megabytes of storage, instead of gigabytes or terabytes.

- Dictionary can be re-trained for specific data at negligble cost.

- Compression and decompression are deterministic by default.

- Doesn't take large amount of GPU resources to compress/decompression.

- This is actually designed to do this.

▲

Terretta

13 days ago

[-]

Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

(To your point, one of those measures isn't including gigabytes of LLM in its size savings, as if it's part of the .exe size instead.)

* EDIT to link to discussion further down: https://news.ycombinator.com/item?id=40245530

▲

Intralexical

13 days ago

[-]

> Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

Yeah. But I don't think it's hinting at any fundamental theoretical limit.

Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).

So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.

In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.

I think a better comparison should probably:

- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.

- Use the same training data for both the LLM and conventional compressor.

- Include the dictionary/model size. And compare methods at the same dictionary/model size.

Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.

▲

Alifatisk

13 days ago

[-]

I am so glad I read your comment, so faschinating!

▲

mmoskal

13 days ago

[-]

A problem with this is that the inference has to be 100% deterministic. For example of you change tailing order of matmul you may get slightly different results. It's easy to imagine people changing tailing in llama.cpp or other libraries. You are very unlikely to get bit exact results from different libraries.

Interestingly (though maybe not relevant here) you can also get different results from multi-user inference systems depending on what other requests are in the batch. It's possible to avoid this but I'm pretty sure most systems don't.

The "slightly different" bit of course makes it worse - it will work 99% of the time.

▲

david_draco

13 days ago

[-]

Probably could set the seed and temperature and store these. Run with a few variations and keeping the best would probably work even better, at the expense of more computation.

▲

clay_the_ripper

14 days ago

[-]

It does seem like this is a possible method to test if an LLM has your data in it.

People have found other ways to do that of course, but this is pretty clever.

▲

mvkel

13 days ago

[-]

Not necessarily. This also uncovers the weakness of the NYT lawsuit.

Imagine in your corpus of training data is the following:

- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"

- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."

- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"

Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"

"According to the New York Times, it rains cats and dogs twice per year."

NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.

Zoom that out to full articles quoted throughout the web, and you get false positives.

▲

refulgentis

13 days ago

[-]

They were getting huge chunks, verbatim of NYT articles out. I remember being stunned. Then I remember finding out there was some sort of trick to it that made it seem sillier.

▲

KaoruAoiShiho

13 days ago

[-]

Was it that NYT articles are routinely pirated on reddit comments and the like?

▲

viraptor

13 days ago

[-]

Does it matter? What's the legal view on "I downloaded some data which turns out to be copied from a copyrighted source and it was probably trivial to figure it out, then trained the LLM on it"? I mean, they work on data processing - of course they would expect that if someone responds with 10 paragraphs in reporting style, under a link to NYT... that's just the article.

▲

evilduck

13 days ago

[-]

I genuinely don’t know the answer but I can see it being more complicated than “OpenAI purposefully acquired and trained on NYT articles”.

If Stack Overflow collects a bunch of questions and comments and expose them as a big dataset licensed as Creative Commons but it actually contains a quite bit of copyrighted content, whose responsibility is it to validate copyright violations in that data? If I use something licensed as CC in good faith and it turns out the provider or seller of that content had no right to relicense it, am I culpable? Is this just a new lawsuit where I can seek damages for the lawsuit I just lost?

▲

yifanl

12 days ago

[-]

discussed 20 years ago https://ansuz.sooke.bc.ca/entry/23

> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.

▲

evilduck

12 days ago

[-]

I don't think that's what I was driving at. Monolith users in this scenario would be knowingly using copyrighted content with the clear intent to "de-copyright" it for distribution purposes by mixing it up into a new output via a reversible process. Which seems like it probably violates copyright because the intent is to distribute a copyrighted work even if the process makes programmatic detection difficult during distribution. This may operate within the wording of the law but it clearly is being done in bad faith to the spirit of the law (and this seems like standard file encryption of a copyrighted work where you are also publicly distributing the decryption key... and transmitting a copyrighted work over TLS today doesn't absolve anyone of liability). You seem to be suggesting this is what OpenAI has done via the transformer model training process - and acting in bad faith. Which is certainly possible but won't be proven unless their court case reveals it. I'm asking about the opposite: what if they acted in good faith?

What I'm getting at is that it's plausible that a LLM is trained purely on things that were available and licensed as Creative Commons but that the data within contains copyrighted content because someone who contributed to it lied about their ownership rights to provide that content under a Creative Commons license, i.e. StackOverflow user UnicornWitness24 is the perpetrator of the copyright violation by copying a NYT article into a reply to bypass a paywall for other users and has now poisoned a dataset. And I'm asking: What is the civil liability for copyright violations if the defendant was the one who was actually defrauded or deceived and was acting in good faith and within the bounds of the law at the time?

▲

mvkel

12 days ago

[-]

Fair use in copyright:

it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.

But yes, open to interpretation as far as where LLM training falls.

▲

KaoruAoiShiho

13 days ago

[-]

I dunno, I'm not a lawyer, it might matter.

▲

jxy

13 days ago

[-]

There was this: Compression Represents Intelligence Linearly, https://arxiv.org/html/2404.09937v1

▲

pronoiac

15 days ago

[-]

I expected general-purpose compressors to do better, but I was wrong. For the whole document:

  174355 pg11.txt
   60907 pg11.txt.gz-9
   58590 pg11.txt.zstd-9
   54164 pg11.txt.xz-9
   25360 [from blog post]

▲

Legend2440

13 days ago

[-]

LLMs blow away traditional compressors because they are very good predictors, and prediction and compression are the same operation.

You can convert any predictor into a lossless compressor by feeding the output probabilities into an entropy coding algorithm. LLMs can get compression ratios as high as 95% (0.4 bits per character) on english text.

https://arxiv.org/html/2404.09937v1

▲

Der_Einzige

13 days ago

[-]

The 1-1 correspondence between prediction and compression is one of the most counterintuitive and fascinating things in all of AI to me.

▲

mjburgess

13 days ago

[-]

Its only equivalent for a very narrow sense of 'prediction' , namely modelling conditional probability distributions over known data.

There's no sense, for example, in which deriving a prediction about the nature of reality from a novel scientific theory is 'compression'

eg., suppose we didn't know a planet existed, and we looked at orbital data. There's no sense in which compressing that data would indicate another planet existed.

It's a great source of confusion that people think AI/ML systems are 'predicting' novel distributions of observations (science), vs., novel observations of the same distribution (statistics).

It should be more obvious that the latter is just compression, since it's just taking a known distribution of data and replacing it with a derivative optimal value.

Science predicts novel distributions based on theories, ie., it says the world is other than we previously supposed.

▲

Legend2440

13 days ago

[-]

It doesn’t matter how your predictor works, whether it’s a scientific theory, a statistical model, or a magic oracle. You can always perfectly convert its predictions into compression using entropy coding. The conversion process is black-box.

▲

mjburgess

12 days ago

[-]

yes, and this is a very trivial notion of prediction.

Predictions of novel objects derivative from scientific theories arent quantitative data points.

▲

palmtree3000

13 days ago

[-]

Sure it is! If we were trying to compress an archive of orbital data, one way to do it would be "initial positions + periodic error correction". If you have the new planet, your errors will be smaller and can be represented in less space at the same precision.

▲

mjburgess

13 days ago

[-]

This is assuming the theory.

A statistical model of orbits, without a theory of gravity, is less compressed when you assume more objects. Take all the apparent positions of objects in the sky, {(object, x1, x2, t),...}. Find a statistical model of each point at t+1, so y = (o, x1, x2, t+1). There is no sense in which you're deriving a new object in the sky from this statistical model -- it is only a compression of observable orbits.

When you say, "if you have the new planet", you're changing the data generating process (theory) to produce a new distribution of points {(o' x1', x2', t'), ...} to include an unseen object. You're then comparing two data generating models (two theories) for their simplicity. You're not comparing the associative models.

Call the prior theory 8-planets, so 8P generates x1,x2,t; and the new theory 9P which generates x1',x2',t'

You're then making a conditional error distribution when comparing two rival theories. The 9P theory will minimize this error.

But in no sense can the 9P theory be derived from the initial associative statistical distribution. You are, based on theory (, science, knowledge, etc.) choosing to add a planet, vs. eg., correcting for measurement error, modifiying newton's laws, changing the angle of the earth wrt the solar system... or one of an infinite number of theories which all produce the same error minimization

The sense of "prediction" that science uses (via Popper et al.) is deriving the existence of novel phenomena that do not follow from prior observable distributions.

▲

mr_toad

12 days ago

[-]

> A statistical model of orbits, without a theory of gravity

You want a statistical model that produces a theory of gravity.

▲

immibis

13 days ago

[-]

And here's the opposite - using gzip as a (merely adequate) "large" language model: https://aclanthology.org/2023.findings-acl.426.pdf

By the way, if you want to see how well gzip actually models language, take any gzipped file, flip a few bits, and unzip it. If it gives you a checksum error, ignore that. You might have to unzip in a streaming way so that it can't tell the checksum is wrong until it's already printed the wrong data that you want to see.

▲

Intralexical

14 days ago

[-]

IDK why, but BZIP2 seems to do somewhat better that other compression algorithms for natural language text:

    $ curl https://www.gutenberg.org/cache/epub/11/pg11.txt | bzip2 --best | wc
        246    1183   48925

Also, ZSTD goes all the way up to `--ultra -22` plus `--long=31` (4GB window— Irrelevant here since the file fits in the default 8MB anyway).

▲

pastage

13 days ago

[-]

You can use a preshared dictionary with bzip2 and zstd so you can get that down alot by using different dictionaries depending on certain rules. I dont know if it helps with literature but I had great success in sending databases with free text like that. In the end it was easier to just use one dictionary for everything and just skip the rules.

▲

anonu

13 days ago

[-]

Shouldn't you factor in the size of the compression tools needed?

▲

sp332

13 days ago

[-]

For the prize, yes. But it’s interesting to see that even a much larger decompressor might not necessarily buy you a smaller file.

▲

TeMPOraL

13 days ago

[-]

Not if you're interested in compressing more than one file - compression tools are a fixed, one-time cost in size.

▲

EgoIncarnate

14 days ago

[-]

Which model did you use?

▲

KTibow

13 days ago

[-]

On the topic of very strong compression: http://prize.hutter1.net/

▲

ilaksh

13 days ago

[-]

Did I miss it, or did he completely leave out the name of the LLM model he used?

▲

number6

13 days ago

[-]

> I'll use llama.cpp via its python bindings.

llama.cpp also has a link to the model

▲

gliptic

13 days ago

[-]

llama.cpp isn't a model though.

▲

throwaway81523

13 days ago

[-]

A compression method something like this figured into Vernor Vinge's 1988-era short story "The Blabber".

▲

personjerry

13 days ago

[-]

I mean, sure this is compression in the sense that I can send you a tiny "compressed text" and all you need is this multi-terabyte model to decompress it!

▲

markisus

13 days ago

[-]

But we don’t usually count the decompression program size when evaluating compression ratio. Eg 7-Zip is about 1 MB, but you don’t think about that when evaluating particular 7z files.

▲

vintermann

13 days ago

[-]

We do when it's the Hutter prize, otherwise it's easy to cheat.

But sure, it's a constant factor, so if you compress enough data you can always ignore it.

▲

Filligree

13 days ago

[-]

We would if it’s a multi-gigabyte program the receiver doesn’t have installed.

▲

viraptor

13 days ago

[-]

Maybe not multi-gigabyte, but in a new system/phone in a year, you're basically guaranteed to find at least one tiny model. We may even get some "standard" model everyone can reliably use as a reference.

▲

Filligree

13 days ago

[-]

At that point it would be useful. Although, I wonder if it wouldn’t make more sense to train one specifically for the job. Current LLMs can predict HTML, sure, but they’re large and slow for the task.

▲

markisus

13 days ago

[-]

Yeah sometimes compression ratio is not the right question when there are other practical concerns like disk space or user experience.

But I do want to point out that almost everyone installs at least one multigigabyte file to decompress other files, and that is the OS.

▲

Filligree

12 days ago

[-]

If the only thing the OS could do is decompress files, then we'd be rightly upset at that for being multi-gigabyte as well. :)

▲

Legend2440

13 days ago

[-]

There's an existing idea called shared dictionary compression. Everybody pre-agrees on some statistical priors about the data, and you use them to improve the compression ratio.

This is just the gigascale version of that.

▲

piloto_ciego

12 days ago

[-]

Potentially hot take, but compression is what I think language is for.

It’s easier to transmit a story than the entire sensory experience of something happening, so it saves us time and attention. Being able to convert complex tasks and ideas into short stories has likely served us well since the brain burns a lot of calories.

▲

65a

12 days ago

[-]

I don't think this is a hot take at all, it's matches my understanding. One of the reasons language itself is so difficult (miscommunication, etc) is we have a mostly similar but not identical "compression table" of ideas that words map to, and why we spend so much time aligning on terms, to ensure correct "decompression".

We need compression because internally cognition has a very "wide" set of inputs/outputs (basically meshed), but we only have a serial audio link with relatively narrow bandwidths, so the whole thing that allows us to even discuss this is basically a neat evolutionary hack.

▲

kmeisthax

12 days ago

[-]

There's a Google paper from a year ago that proposes using this as a statistical test to determine if a model was trained on a particular text or not.

▲

tednoob

13 days ago

[-]

Is this method used during training? Seems to me there could be a point to only backpropagate when the model is wrong?

▲

leodriesch

13 days ago

[-]

The model is always wrong, since it predicts a propability distribution over all possible tokens, but the target has 100% possibility for one token and 0 for all others.

▲

zwaps

13 days ago

[-]

I mean this is implicit in back propagation, say, you need to store gradients anyway but if you get to a zero loss than you are just done.

▲

Intralexical

14 days ago

[-]

The practical way to do this is to train a custom dictionary with a normal compression algorithm.

▲

infamouscow