Does this even make sense? Are the copyright laws so bad that a statement like this would actually be in NVIDIA’s favor?
Quoting the text which the FSF put at the top of that page:
"This paper is published as part of our call for community whitepapers on Copilot. The papers contain opinions with which the FSF may or may not agree, and any views expressed by the authors do not necessarily represent the Free Software Foundation. They were selected because we thought they advanced the discussion of important questions, and did so clearly."
So, they asked the community to share thoughts on this topic, and they're publishing interesting viewpoints that clearly advance the discussion, whether or not they end up agreeing with them. I do acknowledge that they paid $500 for each paper they published, which gives some validity to your use of the verb "commissioned", but that's a separate question from whether the FSF agrees with the conclusions. They certainly didn't choose a specific author or set of authors to write a paper on a specific topic before the paper was written, which a commission usually involves, and even then the commissioning organization doesn't always agree with the paper's conclusion unless the commission isn't considered done until the paper is updated to match the desired conclusion.
> You will notice that the FSF has not rushed out to file copyright infringement suits even though they probably have more reason to oppose LLMs trained on FOSS code than anyone else in the world.
This would be consistent with them agreeing with this paper's conclusion, sure. But that's not the only possibility it's consistent with.
It could alternatively be because they discovered or reasonably should have discovered the copyright infringement less than three years ago, therefore still have time remaining in their statute of limitations, and are taking their time to make sure they file the best possible legal complaint in the most favorable available venue.
Or it could simply be because they don't think they can afford the legal and PR fight that would likely result.
It is impossible to tell how much AI any creator used secretly, so now all works are under suspicion. If copyright maximalists successfully copyright style (vibes), then creativity will be threatened. If they don't succeed, then copyright protection will be meaningless. A catch 22.
It makes some sense, yeah. There's also precedent, in google scanning massive amounts of books, but not reproducing them. Most of our current copyright laws deal with reproductions. That's a no-no. It gets murky on the rest. Nvda's argument here is that they're not reproducing the works, they're not providing the works for other people, they're "scanning the books and computing some statistics over the entire set". Kinda similar to Google. Kinda not.
I don't see how they get around "procuring them" from 3rd party dubious sources, but oh well. The only certain thing is that our current laws didn't cover this, and probably now it's too late.
As a consumer you are unlikely to be targeted for such "end-user" infringement, but that doesn't mean it's not infringement.
This is the conclusion of the saga between the author's guild v. google. It goes through a lot of factors, but in the end the conclusion is this:
> In sum, we conclude that: (1) Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses. The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google’s commercial nature and profit motivation do not justify denial of fair use. (2) Google’s provision of digitized copies to the libraries that supplied the books, on the understanding that the libraries will use the copies in a manner consistent with the copyright law, also does not constitute infringement. Nor, on this record, is Google a contributory infringer.
Yeah, isn't this what Anthropic was found guilty off?
Our copyright laws are nowhere near detailed enough to specify anything in detail here so there is indeed a logical and technical inconsistency here.
I can definitely see these laws evolving into things that are human centric. It’s permissible for a human to do something but not for an AI.
What is consistent is that obtaining the books was probably illegal, but say if nvidia bought one kindle copy of each book from Amazon and scraped everything for training then that falls into the grey zone.
Perhaps, but reproducing the book from this memory could very well be illegal.
And these models are all about production.
Most of the best fit curve runs along a path that doesn’t even touch an actual data point.
Yes, and that's stupid, and will need to be changed.
These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:
So the illegality rests at the point of output and not at the point of input.
I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.
It's not as simple as that, as this settlement shows [1].
Also, generating output is what these models are primarily trained for.
Yes but not generating illegal output. These models were trained with intent to generate legal output. The fact that it can generate illegal output is a side effect. That's my point.
If you use AI to generate illegal output, that act is illegal. If you use AI to generate legal output that act is not illegal. Thus the point of output is where the legal question lies. From inception up to training there is clear legal precedence for the existence of AI models.
A type of wishful thinking fallacy.
In law scale matters. It's legal for you to possess a single joint. It's not legal to possess 400 tons of weed in a warehouse.
Scale is only used for emergence, openAI found that training transformers on the entire internet would make is more then just a next token predictor and that is the intent everyone is going for when building these things.
No wishful thinking here.
I'm not sure you understood what you said, but superficially it appears that you are agreeing with me?
Just because it's legal to read 100s of books does not make it legal to slurp up every single piece of produced content ever recorded.
We're talking man many orders of magnitude in scale there, and you're the one who pointed out that scale :-/
And now AI has killed his day job writing legal summaries. So they took his words without a license and used them to put him out of a job.
Really rubs in that “shit on the little guy” vibe.
The government is in full support of this "lending" concept, in fact they have created entire facilities devoted to this very concept of lending out books.
Everything else will be slurped up for and with AI and be reused.
(The difference, is that the first use allows ordinary poeple to get smarter, while the second use allows rich people to get (seemingly) richer, a much more important thing)
I assume you're expecting that they'll reach out and cut a deal with each publishing house separately, and then those publishing houses will have to somehow transfer their data over to NVIDIA. But that's a very custom set of discussions and deals that have to be struck.
I think they're going to the pirate libraries because the product they want doesn't exist.
I keep hearing how it's fine because synthetic data will solve it all, how new techniques, feedback etc. Then why do that?
The promises are not matching the resources available and this makes it blatantly clear.
• Anna’s Archive: ~61.7 million “books” (plus ~95.7M papers) as of January 2026 https://en.wikipedia.org/wiki/Anna%27s_Archive • Amazon Kindle: “over 6 million titles” as of March 2018 https://en.wikipedia.org/wiki/Anna%27s_Archive
Hard to compare because AA contains duplicates, and the Kindle number is old, but at a glance it seems AA wins.
This is analogous the difference between Gmail using search within your mail content to find messages that you are looking for vs Gmail providing ads inside Gmail based on the content of your email (which they don't do).
And yeah, you're most likely right about the first, and the contract writers have with Amazon most certainly anticipates this, and includes both uses in their contract. But! Never published on Amazon, so don't know, but I'm guessing they already have the rights for doing so with what people been uploading these last few years.
And yeah they should be sued into the next century for copyright infringement. $4Trillion company illegally downloading the entire corpus of published literature for reuse is clearly infringement, its an absurdity to say that it’s fair use just to look for statistical correlations when training LLMs that will be used to render human authors worthless. One or two books is fair use. Every single book published is not.
> NVIDIA is also developing its own models, including NeMo, Retro-48B, InstructRetro, and Megatron. These are trained using their own hardware and with help from large text libraries, much like other tech giants do.
You can download the models here: https://huggingface.co/nvidiaIt's basically just a sales demonstrator, that optionally, if incredibly successful and costly they can still sell as SaaS, if not just offer for free.
Think of it as a tech ad.