The core story seems to be: Westlaw writes and owns headnotes that help lawyers find legal cases about a particular topic. Ross paid people to translate those headnotes into new text, trained an AI on the translations, and used those to make a model that helps lawyers find legal cases about a particular topic. In that specific instance the court says this plan isn't fair use. If it was fair use, one could presumably just pay people to translate headnotes directly and make a Westlaw competitor, since translating headnotes is cheaper than writing new ones. And conversely if it isn't fair use where's the harm (the court notes no copyright violation was necessary for interoperability for example) -- one can still pay people to write fresh headnotes from caselaw and create the same training set.
The court emphasizes "Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today." But I'm not sure "generative" is that meaningful a distinction here.
You can definitely see how AI companies will be hustling to distinguish this from "we trained on copyrighted documents, and made a general purpose AI, and then people paid to use our AI to compete with the people who owned the documents." It's not quite the same, the connection is less direct, but it's not totally different.
One aspect is the court’s ruling that West’s headnotes are copyrightable even when they merely quote a court opinion verbatim, because the editorial decision to quote the material itself shows a “creative spark”. It really isn’t workable — in law specifically - for copyright to attach to the mere selection of a quote from a case to represent that case’s holding on an issue. After all, we would expect many lawyers analyzing the case independently to converge on the same quotes!
The key fact underlying all of this, I think, is that when Ross paid human annotators to write their own versions of the headnotes, they really did crib from West’s wholesale rather than doing their own independent analysis. Source text was paraphrased using curiously similar language to West’s paraphrasing. That, plus the fact that Ross was a directly competing product, is what I see as really driving this decision.
The case has very little to say about the more commonly posed question of whether copyright is infringed in large-scale language modeling.
The "competing product" thing is probably the most extreme part of this opinion.
The most important fair use factor is if the use competes with the original work, but this is generally implied to be directly competes, i.e. if you translate someone else's book from English to French and want to sell the translation, the translation is going to be in direct competition for sales to people who speak both English and French. The customer is going to use the copy claiming fair use as a direct substitute for the original work, instead of buying it.
This court is trying to extend that to anything downstream from it, which seems crazy. For example, "multiple copies for classroom use" is one of the explicit examples of fair use from the copyright statute, but schools are obviously teaching people intending to go into competition with the original author, and in general the idea that you can't read something if you ever intend to write something to sell in competition with it seems absurd and in contradiction to the common practices in reverse engineering.
But this is also a district court opinion that isn't even binding on other courts, so we'll see what happens if it gets appealed.
The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition. Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price. Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.
But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?
It's written into the statute as an example of something that would be fair use.
> The idea that the schools are encouraging the students to compete with the original authors of works taught in the classroom is fanciful by the meaning that courts usually apply to competition.
People go to art school primarily because they want to create art. People study computer science primarily because they want to write code. It's their direct intention and purpose to compete with existing works.
> Your example is different from this case in which Ross wanted to compete in the same market against West offering a similar service at a lower price.
So if you use Windows and then want to create Linux...
> Another reason that the schools get a carveout is because it would make most education impractical without each school obtaining special licenses for public performance for every work referenced in the classroom.
How is that logic any different than for AI training?
> But maybe that also provokes the question as to if schools really deserve that kind of sweetheart treatment (a massive indirect subsidy), or does it over-privileges formal schools relative to the commons at large?
It not only doesn't have any explicit requirement for a formal school (it just says "teaching"), it also isn't limited to teaching, teaching is just one of the things specified in the statute as being the kind of thing Congress intended fair use to include.
Statutory text controls what the courts can do, even and perhaps especially when it includes an example.
>People go to art school primarily because they want to create art. People study computer science primarily because they want to write code. It's their direct intention and purpose to compete with existing works.
Interesting perspective.
>So if you use Windows and then want to create Linux...
I don't understand your meaning.
>How is that logic any different than for AI training?
That is what Mark Lemley, law professor at Stanford, has argued in his many law review articles and amicus briefs: he believes that training is analogous to learning. The court here didn't agree with the Lemley view.
>It not only doesn't have any explicit requirement for a formal school (it just says "teaching"), it also isn't limited to teaching, teaching is just one of the things specified in the statute as being the kind of thing Congress intended fair use to include.
In practice courts tend to limit these exceptions to formal teaching arrangements.
If you wrote a program that automatically rephrased an original text - something like the Encyclopaedia Britannica - to preserve the meaning but not have identical phrasing - and then sold access to that information on in a way that undercut the original - then in my view that's clearly ripping off the original creators of the Encyclopedia and would likely stop people writing new versions of the encyclopedia in the future if such activity was allowed.
These laws are there to make sure that valuable activities continue to happen and are not stopped because of theft. We need textbooks, we need journalistic articles - to get these requires people to be paid to work on them.
I think it's entirely reasonable to say that an LLM is such a program - and if used on sources which are sustained by having paid people work on them, and then the reformatted content is sold on in a way to under cut the original activity then that's a theft that's clearly damaging society.
I see LLM's as simply a different way to access the underlying content - the rules of the underlying content should still apply - ChatGPTs revenues are predicted to be in the billions this year - sending some of that to content creators, so that content continues to be produced, is not just right - it's in their interest.
Note that it's very hard to do this starting from a single source, because in order to be safe from any copyright concern you'd have to only preserve the bare "idea" and everything else in your text must be independent. But LLM's seem to be able to get around this by looking at many sources that are all talking about the same facts and ideas in very different ways, and then successfully generalizing "out of sample" to a different expression of the same ideas.
ie the way to avoid copyright is to double down on the copying?
I can see how, for a human, you could argue that there is creativity in splicing those bits together into a good whole - however if that process is automated - is it still creative - or just automated theft?
That is the opposite of the ruling. The judge said the ones that summarize and pick out the important parts are copyrightable and specifically excludes the headnotes that quote court opinion verbatim.
The judge:
"But I am still not granting summary judgment on any headnotes that are verbatim copies of the case opinion (for reasons that I explain below)"
> More than that, each headnote is an individual, copyrightable work. That became clear to me once I analogized the lawyer’s editorial judgment to that of a sculptor. A block of raw marble, like a judicial opinion, is not copyrightable. Yet a sculptor creates a sculpture by choosing what to cut away and what to leave in place. That sculpture is copyrightable. 17 U.S.C. §102(a)(5). So too, even a headnote taken verbatim from an opinion is a carefully chosen fraction of the whole. Identifying which words matter and chiseling away the surrounding mass expresses the editor’s idea about what the important point of law from the opinion is. That editorial expression has enough “creative spark” to be original. ... So all headnotes, even any that quote judicial opinions verbatim, have original value as individual works.
I personally don't think this sculpture metaphor works for verbatim quotes from judicial opinions.
The marble from which a sculpture is carved is not itself a copyrighted work, and if we imagine it as having copyright protection, to the extent it's recognizable after editorial expression it'd have to qualify as fair use itself.
It's not ludicrous at all. Whether a work of "selection" from an existing source can be copyrightable in its own right would probably have to be judged on pretty much a case-by-case basis, but even in the context of "selecting" from a ruling there are almost certainly many cases where that work is creative and original enough that it can sensibly be protected by copyright.
I guess it depends on how long the source is, and how long the collection of quotes is, if we’d expect multiple lawyers to converge on the same solution. I don’t think it is totally obvious, though…
I’m also not sure if that’s a generally good test. It seems great for, like, painting. But I wouldn’t be surprised if we could come up with a photography scene where most professionals would converge on the same shot…
You could argue that all the words are already in the dictionary - so none of them are new, you are just quoting from the dictionary in a particular order......
The reason you have people, rather than computers interpreting the law, is you can make judgements that make sense. Fundamentally these laws are there to protect work being unfairly ripped off.
What was clearly done in this case was a rip-off which damaged the original creator - everything else is dancing on the head of a pin.
The detail of how to do that in fair way that doesn't block other people is complex[1] - you can never cover all possibilities in a written law - that's why you have people interpreting them and making judgements. All I'm saying is the guiding light in that interpretation is copyright is there to protect the justifiable work of people in a fair way.
Somebody taking those law notes and trivially copying them to directly compete is clearly not 'fair use'.
If those notes could have been created mechanically directly from the original source - why didn't the copier do that - rather than use the competitors work?
[1] given the endless creativity of humans to game systems.
..."to promote the progress of science and useful arts". I don't see anything in there about rewarding 'work' irrespective of whether that work involves any kind of creativity.
> If those notes could have been created mechanically directly from the original source - why didn't the copier do that
That's actually a very good question. In practice, I do absolutely agree that the notes involve plenty of originality and creativity.
Not sure where you got that quote from, but I'd say the work aspect is implicit in the "promote the progress" - ie progress requires that people are able to get paid in their work to progress science or the useful arts.
If the progress was trivial and required no work then it wouldn't need protection or promotion.
And sure it's phrased that way to get the balance between fair use and protection - but if there was no need of protection then copyright wouldn't need to exist - as free reuse is the default.
Have you seen different? I’m curious what area of law you practice and in what state, for comparison’s sake.
> The federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.
And the opinion reads:
> [T]he federal system contemplates that individual states may adopt distinct policies to protect their own residents and generally may apply those policies to businesses that choose to conduct business within that state.
... so it follows that it was then Ross's annotators showing the creative spark
Ross’s use is not transformative. Transformativeness is about the purpose of the use. “If an original work and a secondary use share the same or highly similar purposes, and the second use is of a commercial nature, the first factor is likely to weigh against fair use, absent some other justification for copying.” Warhol, 598 U.S. at 532–33. It weighs against fair use here. Ross’s use is not transformative because it does not have a “further purpose or different character” from Thomson Reuters’s. Id. at 529.
Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw. It is undisputed that Ross’s AI is not generative AI (AI that writes new content itself). Rather, when a user enters a legal question, Ross spits back relevant judicial opinions that have already been written. D.I. 723 at 5. That process resembles how Westlaw uses headnotes and key numbers to return a list of cases with fitting headnotes.
I think it's quite relevant that this was not generative AI: the reason that mattered is that "transformative" use biases towards Fair Use exemptions from copyright. However, this wasn't creating new content or giving people a new way to understand the data: it was just used in a search engine, much like Westlaw provided a legal search engine. The judge is pointing out that the exact implementation details of a search engine don't grant Fair Use.
This doesn't make a ruling about generative AI, but I think it's a pretty meaningful distinction: writing new content seems much more "transformative" (in a literal sense: the old content is being used to create new content) than simply writing a similar search engine, albeit one with a better search algorithm.
They were doing semantic search using embeddings/rerankers.
The point that reading both decisions together compounds is that if they had trained a model on the Bulk Memos and generated novel text instead of doing direct searches, there likely would have been enough indirection introduced to prevent a summary judgement and this would have gone to a jury as the September decision states.
In other words, from their comment:
> But I'm not sure "generative" is that meaningful a distinction here.
The judge would not seem to agree at all.
Westlaw protects them because they are the "value add." Otherwise their business model is "take published decisions the court is legally bound to provide for free and sell it to you."
An LLM today could easily recreate the headnotes in a far superior manner from scratch with the right prompt. I don't even think hallucinations would factor in on such a small task that was well regulated, but you can always just asterisk the headnotes and put a disclaimer on them.
I always thought they were obviously were copyrightable. Plus they’re not close to perfect either.
Surely creating a general-purpose AI is transformative, though? Are you anticipating that AI companies will be sued for contributory infringement, because customers are using a general-purpose AI to compete with companies which created parts of the training data?
The judge does note that no copyrighted material was distributed to users, because the AI doesn't output that information:
> There is no factual dispute: Ross’s output to an end user does not include a West headnote. What matters is not “the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public for which it may serve as a competing substitute.” Authors Guild, 804 F.3d at 222 (internal quotation marks omitted). Because Ross did not make West headnotes available to the public, Ross benefits from factor three.
But he only does so as part of an analysis of whether there's a valid fair use defense for Ross's copying of the head notes, ignoring the obvious (to me) point that if no copyrighted material was distributed to end users, how can this even be a violation of copyright in the first place?
Obscurity ≠ legal compliance.
This is a good distillation. A bit like "we trained our system on various works of art and music, and now it is being sold as a service that competes with the original artists and musicians."
If it would be illegal for a group of people to do something, it is also going to be illegal for an AI do so.
Why is that so surprising?
This effectively kills open source, which can't afford to license and won't be able to sublicense training data.
This is very bad for democratized access to and development of AI.
The giants will probably want this. The giants were already purchasing legacy media content enterprises (Amazon and MGM, etc.), so this will probably further consolidation and create extreme barriers to entry.
If I were OpenAI, I'd probably be very happy right now. If I were a recent batch YC AI company, I'd be mortified.
To the contrary, this just means companies can't make money from these models.
Those using models for research and personal use wouldn't be infringing under the fair use tests.
Maybe the strategy is something like this:
1) Survive long enough/get enough users that killing the generative AI industry is politically infeasible.
2) Negotiate a compromise similar to the compulsory mechanical royalty system used in the music business to “compensate” the rights holders whose content is used to train the models
The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.
If you take this to its logical conclusion, the AI companies wouldn’t have to pre-license anything, and would just pay out all the royalties to the biggest rights holders (more or less what happens in the music industry) on the basis that figuring out what IP went into what model output is just too hard, so instead they just agree to distribute it to whomever is on the New York Times best seller list at any given moment.
the long tail exists, and there will always be a threshold for payments due to rights holders.
it used to be (like 10 years ago so i might not remember the details exactly) that if you earned less than £1 from youtube performing music rights in a quarter then any money you earned was put back into the pot and redistributed to those earning over £1.
it just wasn’t worth the cost to keep track of £0.00001 earnings for all the rights holder in the bottom of the long tail each quarter, or to pay the bank fees when the eventually earn £0.01 that can be paid to them.
definitely not perfect, but at least some people were getting paid, instead of none.
also, youtube’s data they gave us was fairly shit (video title, url). so that didn’t help. nor did the lack of compute/data proc infrastructure/skills. was historically a manual spreadsheet job trying to work out who to cut.
i had to do it a few times :/
edit —
> The biggest AI companies could even run the enforcement cartels ala BMI/ASCAP to compute and collect royalties owed.
what could happen, for music at least, is the same thing that happened with youtube, mashed up with live music analogies.
a licensing negotiation with BMI/ASCAP/PRS, and maybe major publishers directly if they get frustrated with the PROs. then PROs will use sampling of other revenue streams to work out what the likely popular things are for AI. then divvy up whatever the lump sum is between the most popular songs.
we used to do this for live music. i had to generate the sampled dataset in microsoft access each year and weed out the all the radio stings.
sorry for costing you a million pounds that one year ed sheeran :/
Check out this one cool trick companies found for skirting copyright restrictions.
Lawyers HATE them!
They don't need every copyrighted work and getting a fraction is entirely practical. They would go to some large conglomerate like Getty Images or large publishers or social media whose terms give the site a license to what you post and then the middle men would get a vig and the original authors would get peanuts if anything at all.
But in aggregate it would price out the little guy from creating a competing model, because each creator getting $3 is nothing to the creator but is real money to a small entity when there are a billion creators.
What is needed instead (I doubt politicians read HN, but someone go and tell them) is a new law that regulates training of these models if we want them to exist and be used in a legally safe way. This is needed for example because most jurisdictions have different copyright laws from one another, but software travels globally.
It would make sense to make all books available for non-commercial, perhaps even commercial R&D in AI, if society elected that to be beneficial in the same way that publishers must donate one copy of each new work to a copyright library (Library of Congress Library in the US, Oxford and Cambridge University libraries and British Library in the UK, Frankfurt and Leipzig Nationalbibliotheken for Germany etc.). Just add extra provisions that they need to send a plain text copy to the Linguistic Data Consortium (LDC), which manages datasets for NLP. Like for fair use, there can be provisions to make up for that use that happen automatically in the background (in some countries the price of photocopying machine includes a fee that gets passed on to copyright holders).
Otherwise you'll have one LLM being legal in one country but illegal in another because more than 15% from onw book were in the training data, and other messy situations.
Oh no. Anyway.
I’m good with your proposal if we also revert to the original 14 year + 14 year extension model. As it stands the 120 year copyright is so ridiculously tilted that we should not allow it to extend to veto power over technical advancements.
Whether or not OpenAI is found to be breaking the law will be utterly irrelevant to actual open AI efforts.
No, they won't. The biggest models want to train on literally every piece of human-written text ever written. You can pay to license small subsets of that at a time. You can't pay to license all of it. And some of it won't be available to license at all, at any price.
If the copyright holders win, model trainers will have to pay attention to what they train on, rather than blithely ignoring licenses.
They genuinely don't. There is a LOT of garbage text out there that they don't want. They want to train on every high quality piece of human-written text they can get their hands on (where the definition of "high quality" is a major piece of the secret sauce that makes some LLMs better than others), but that doesn't mean every piece of human-written text.
OpenAI is Uber with a slightly less ethically despicable CEO.
It knows it's flaunting the spirit of copyright law -- it's just hoping it could bootstrap quickly enough to make the question irrelevant.
If every commercial AI company that couldn't prove training data provenance tomorrow was bankrupted, I wouldn't shed an ethical tear. Live by the sword, die by the sword.
For me (Italian) this is amazing! Most Italian judges and lawyers write in a purposely obscure fashion, as if they wanted to keep the plebs away from their holy secrets. This document instead begs to be read; some parts are more in the style of a novel than of a technical document.
Also the judge makes that statement, it looks like he misunderstands the nature of the AI system and the inherent generative elements it includes.
For example, a classifier is a generative model if it models p(example, label) -- which is sufficient to also calculate p(label | example) if you want -- rather than just modeling p(label | example) alone.
Similar example in translation: a generative translation model would model p(french sentence, english sentence) -- implicitly including a language model of p(french) and p(english) in addition to allowing translation p(english | french) and p(french | english). A non-generative translation model would, for instance, only model p(french | english).
I don't exactly understand what this judge meant by "generative", it's presumably not the technical term.
However, and annoyingly so, recently the general public and some experts have been speaking of "generative AI" (or GenAI for short) when they talk about large language models.
This creates the following contradiction:
- large language models are called "generative AI"
- large language models are based on transformers, which are neural networks
- neural networks are discriminative models (not generative ones like Hidden Markov Models)
- discriminative models are the oppositve of generative models, mathematically
So we may say "Generative AI is based on discriminative (not generative) classifiers and regressors". [as I am also a linguist, I regret this usage came into being, but in linguistics you describe how language is used, not how it should be used in a hypothetical world.]
References
- Gen AI (Wikipedia) https://en.wikipedia.org/wiki/Generative_artificial_intellig...
- Discriminative (Conditional) Model (Wikipedia) https://en.wikipedia.org/wiki/Discriminative_model
By your definition, basically every classifier with 2 inputs would be generative. If I have a classifier for the MNIST dataset and my inputs are the pixels of the image, does that make the classifier generative because the inputs aren’t independent from each other?
But many other types of model would give you a joint distribution P(which digit, all pixels), so would be generative. Even if you only used it for classification.
https://en.wikipedia.org/wiki/Generative_model
I guess these days "generative" must mean "it is used to generate outputs that look like the training data".
But until recently, the meaning had to do with the information in the model, not how it's used.
Yep. That's what people have been saying all along. If the intent is to substitute the original, then copying is not fair use.
But the problem is that the current method for training requires this volume of data. So the models are legitimately not viable without massive copyright infringement.
It'll be interesting to see how a defendant with a larger wallet will fare. But this doesn't look good.
Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed. This is inevitable now that it's at least partially framed in national security terms.
But I'd hope that this means there is a chance that if models have to train on all of human content, the weights will be available for free to all humans. If it requires massive copyright infringement on our content, we should all have an ownership stake in the resulting models.
Sure it is. It just requires what every other copyright'd work needs: permission and stipulations from the copyright holder. These aren't small time bloggers on the internet, these are large scale businesses.
>Though big-picture, it seems to me that the money-ed interests will ensure that even if the current legal landscape doesn't allow LLM's to exist, then they will lobby HARD until it is allowed.
The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing. But yes, I'm still frustrated by the hypocrisy.
Most other scenarios don't use millions/billions of works - that's the part which puts viability in question.
> these are large scale businesses.
I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses. Large-scale pretraining is common even for models that are not cutting-edge LLMs.
> The only solace I take is that these conglomerates are paying a lot to take down the rules they made 30 years ago when they weren't the ones profiting from stealing
As far as I'm aware, most of the lobbying in favor of stricter copyright has been done by Disney, Universal, Time Warner, RIAA, etc.
Not to say that tech companies have a consistent moral stance beyond whatever's currently in their financial self-interest, but I think that self-interest has put them in a position of supporting fair use and copyright safe harbors, opposing link tax, etc. more often than the the other way around - with cases like Authors Guild v. Google being a significant win for fair use.
Yes, they do. We have acquisitions in the billions these days and exclusivity deals in the hundreds of millions. Let's not pretend these companies can't do this through normal channels. They just wanna steal because they think they can get away from it.
>I'd like training models to also remain accessible to open-source developers, academic researchers, and smaller businesses.
Same. But such models still need to be ethically sourced. Maybe there's not enough royalty free content to compete with OpenAI, but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective. If we need that much data, there are clearly optimizations to be made.
>I think that self-interest has put them in a position of supporting fair use and copyright safe harbors,
Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright. Microsoft won a lawsuit against web scraping via LinkedIn less than a year before OpenAI fell into legal troubles over scraping the entire internet.
To clarify: veggieroll said training models wouldn't be viable, you said it'd just require licensing like everyone else already manages, I said most other cases don't use millions/billions of works, you're saying that yes they do?
I feel like there must be a misunderstanding here, because that doesn't make much sense to me. Even for making a movie, which I think would be the most onerous of traditional cases, the number of works you'd license would likely be in the dozens (couple of pop songs, some stock images, etc.) - not billions.
> Let's not pretend these companies can't do this through normal channels
I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet". A compulsory licensing scheme, like the US has for cover songs, could make it feasible to pay into a pot - but again I'd really hope for model training to remain accessible to smaller players opposed to just "meh, OpenAI has billions".
> but it's pretty clear from Deepseek that you don't need 82 TB of data to be effective.
As far as I'm aware, DeepSeek is not a low-data model. In fact, given China's more lax approach to copyright, I would not be surprised if the ability to freely pass around shadow libraries and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
> If we need that much data, there are clearly optimizations to be made.
I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all.
> Yet they will sue anytime their data is scraped or otherwise not making the money. Maybe they didn't put trillions into lobbying like others, but they definitely have their fair share od using copyright.
I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis. Again not to say these tech companies are acting in any way except their own self-interest, just that they've generally been more pro-fair-use than pro-strict-copyright on average to my knowledge.
I assumed we were talking about logistics, not tech. I'm sure it will be technically possibly to use less training data overtime (Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz).
I know hindsight is 20/20, but I always felt the earlier approaches were absurdly brute forced.
>I'm not sure that there really has been a normal channel for licencing at the scale of "almost everything on the public Internet"
There isn't. So they'd need to do it the old fashioned way with agreements . Or make some incentive model that has media submit their works with that understanding of training. Or any number of marketing ideas.
I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
>and large archives of torrented data without lawsuits was one of the factors contributing to their fast success relative to western counterparts.
And now they're being slowed down. If not litigsted out of the market. Public trust in AI is falling. The lack of oversight into hallucinations may have even cost a few lives. Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt. Even a few posts here on HN note how the scraping is so rampant that it can spike their hosting costs on websites (so now we need more capthas. And I hate myself for uttering such a sentence).
Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
>I don't think this is necessarily a given - humans evolved on ~4 billion years worth of data, after all
Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
We both know it's instead a rat race and the goal isn't survival and passing on knowledge (and genes) to the next generation. AI can evolve organically but it instead devolved into a thieve's den.
I take the approach more like Bell's Spacecraft paradox. If they started gaining data ethically, by the time they gather a decent chunk they probably would have already optimized a model that needs less data. It'd be slower but not actually much slower I'm the long run. But they aren't exactly trying to go for quality here.
>I believe lawsuits launched by or fuss kicked up by model developers will typically be on a contract basis (i.e "you agreed to our ToS then broke it") rather than a copyright basis.
I suppose we'll see. Too early to tell. This lawsuit will definitely be precedent in other ongoing cases, but others may shift to a copyright infringement case anyway. Unlike other llms there was some human tailoring going on here, so it's not fully comparable to something like the NYT case.
Still uncertain what you mean - the logistics of creating something? Logistics as in transporting goods? Either way I think veggieroll's point on viability still stands.
> Deepseek is more or less demonstrating that in real time. Maybe there's copyright data but I'd be surprised if it used anything close to 80 TB like competittorz
* GPT-4 is reported to have been trained on 13 trillion tokens total - which is counting two passes over a dataset of 6 trillion tokens[0]
* DeepSeek-V3, the previous model that DeepSeek-R1 was fine-tuned from, is reported to have been pre-trained on a dataset of 14.8 trillion tokens[1]
Can't find any licensing deals DeepSeek have made, so vast majority of that will almost certainly be unlicensed data - possibly from CommonCrawl and shadow libraries.
[0]: https://patmcguinness.substack.com/p/gpt-4-details-revealed
[1]: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSee...
> > > Let's not pretend these companies can't do this through normal channels.
> > I'm not sure that there really has been a normal channel [...]
> There isn't.
Then, surely it's not just pretending?
A while back, as a side project, I'd had a go at making a tool to describe photos for visually impaired users. I contacted Getty to see if I could license images for model training, and was told directly that they don't license images for machine learning. Particuarly given that I'm not massive company, I just don't think there really are any viable paths at the moment except for using web-scraped datasets.
> So they'd need to do it the old fashioned way with agreements .
I'm sceptical of whether even the largest companies would be able to get sufficient data for pre-training models like LLMs from only explicit licensing agreements.
> I don't exactly pity their herculean effort. Those same companies spend decades suing individuals for much pettier uses and building those precedent up (some covered under free use).
I feel you're conflating two groups: model developers that have previously been (on average) supportive of fair-use, and media companies (such as the ones currently launching lawsuits against model training) that lobbied for stronger copyright law. Both are acting in self-interest, but I'd disagree with the idea that there was any significant switching of sides on the topic of copyright.
> Content creators now need to take extra precautions so they aren't stolen from because they don't even bother trying to respect robots.txt.
The major US players claim to respect robots.txt[2][3][4], as does CommonCrawl[5] which is what the smaller players are likely to use.
You can verify that CommonCrawl respects robots.txt by downloading it yourself and checking.
If OpenAI/etc. are lying, it should be possible for essentially anyone hosting a website to prove it by showing access from one of the IPs they use for scraping[6]. (I say IPs rather than useragent string because anyone can set their useragent string to anything they want, and it's common for malicious/poorly-behaved actors to pretend to be a browser or more common bot).
[2]: https://platform.openai.com/docs/bots
[3]: https://support.anthropic.com/en/articles/8896518-does-anthr...
[4]: https://blog.google/technology/ai/an-update-on-web-publisher...
[5]: https://commoncrawl.org/faq
[6]: https://openai.com/gptbot.json
> Was all that velocity worth it? Who benefitted from this outside of a few billipnaires? We can't even say we beat China on this.
There's been a large range of beneficial uses for machine learning: language translation, video transcription, material/product defect detection, weather forecasting/early warning systems, OCR, spam filtering, protein folding, tumor segmentation, drug discovery and interaction prediction, etc.
I think this mainly comes back to my point that large-scale pretraining is not just for LLM chatbots. If you want to see the full impact, you can't just have tunnel-vision on the most currently-hyped product of the largest companies.
> Humans inherit their data and slowly structure around that. Maybe if AI models collaborated together as humanity did, I would sympathize more with this argument.
Machine learning in general (not "OpenAI") is a fairly open and collaborative field. Source code for training/testing is commonly available to use and improve; papers documenting algorithms, benchmarks, and experiments are freely available; arXiv (Cornell University's open-access preprint repository) is the place for AI papers, opposed to paywalled journals; and it's very common to fine-tune someone's existing pretrained model to perform a new task (transfer learning) opposed to training from scratch.
I'd attribute a lot of the field's success to building off each others' work in this way. In other industries, new concepts like transformers or low-rank-adaptation might still be languishing under a patent instead of having been integrated and improved on by countless other groups.
> AI can evolve organically but it instead devolved into a thieve's den.
Unclear what you mean by organically - evolution still needs data.
This is one of those things that signal how dumb this technology still is - or maybe how smart humans are when compared to machines. A human brain doesn't need anywhere close to this volume of data, in order to be able to produce good output.
I remember talking with friends 30 years ago about how it was inevitable that the brain would eventually be fully implemented as machine, once calculation power gets big enough; but it looks like we're still very far from that.
Maybe not directly, but consider that our brains are the product of million of years of evolution and aren't a blank slate when we're born. Even though babies can't speak a language at birth, they already have all the neural connections in place in order to acquire and manipulate language, and require just a few years of "supervised fine tuning" to learn the actual language.
LLMs, on the other hand, start with their weights at random values and need to catch up with those million years of evolution first.
* Preprocessed since the data is actually of 1D streams of characters, and not 2D colour points (as with vision models).
A lot of what we're able to do has to be from some sort of generic capability.
> I remember talking with friends 30 years ago
I'd say you're pretty old. How many years of training did it take for you to start producing good output?
The leason here is we're kind of meta-trained: our minds are primed to pick up new things quickly by abstracting them and relating them to things we already know. We work in concepts and mental models rather than text. LLMs are incredibly weak by comparison. They only understand token sequences.
There's enough money in the market to fund a lot of research into totally novel underlying methods. But if it takes too long, investors and lawmakers will just move to make what already works legal, because it is useful.
Why would it be?
"It's inevitable that the Burj Khalifa gets built, once steel production gets high enough."
"It's inevitable that Pegasuses will be bred from horses, as soon as somebody collects enough oats."
Reducing intelligence to the bulk aggregate of brute "calculation power" is... Ironically missing the point of intelligence.
Copyright is not about acquisition, it is about publication and/or distribution. If I get a copy of Harry Potter from a dumpster, I can read it. If a company gets a copy of *all books from a torrent, they can use it to train their AI. The torrent providers may be in violation of copyright, and if the AI can be used to reproduce substantive portions of the original text, the AI companies then may be in violation of copyright, but simply training a model on illegally distributed text should not be copyright infringement.
You can train a model on copyrighted text, you just can't distribute the output in any way without violating copyright. (edit: depending on the other fair use factors).
One of the big problems is that training is a mechanical process, so there is a direct line between the copyrighted works and the model's output, regardless of the form of the output. Just on those terms it is very likely to be a copyright violation. Even if they don't reproduce substantive portions, what they do reproduce is a derived work.
I edited my post to make it a bit clearer.
Google making thumbnails or scanning books are both arguably "mechanical". Both have been ruled as fair use.
What if I’m a simulated brain running on a chip? What if I’m just a super-smart human and instead of reading and writing in the conventional way, I work out the LLM math in my head to generate the output?
Any examples of people being sued for merely downloading? "Torrenting" basically always involves uploading, even if you stop immediately after completion. A better test would be if someone was sued for using an illegal streaming site, which to my knowledge has never happened.
But that's not what anyone is doing. People train models so that someone can actually use them. So I'm not sure how your comment is helpful other than to point out that distinction (which doesn't make much difference in this case specifically or how copyright applies for LLM's in general)
Simply running my business on illegally distributed copyrighted text/software/movie should not be copyright infringement.
At least some current AI providers, however, come with terms of service that promise that they will cover any such legal disputes for you.
It would be interesting to see how this holds up in court.
"Your honor, I didn't watch the movie I downloaded, I only used it to train an AI."
I highly suspect it would not matter.
"a person reading" and "computer processing of data" (training) are not the same thing
MDY Industries, LLC v. Blizzard Entertainment, Inc. rendered the verdict that loading unlicensed copyrighted material from disk was "copying", and hence copyright infringement
Ross was trying to compete with Westlaw, but used Westlaw as an input. West's "Key Numbers" are, after a century and a half, a de-facto standard.[2] So Ross had to match that proprietary indexing system to compete. Their output had to match Westlaw's rather closely. That's the underlying problem. The court ruled that the objective was to directly compete with Westlaw, and using Westlaw's output to do that was intentional copyright infringement.
This looks like a narrow holding, not one that generally covers feeding content into AI training systems.
[1] https://apnews.com/article/google-france-news-publishers-cop...
If this was only about key numbers, it might have gone the other way because the fact-like element there is considerably greater.
What's funny is that any SOTA LLM today could definitely author them, and even LexisNexis advertises the fact: https://www.lexisnexis.com/community/insights/legal/b/produc...
It’s been interesting that media where watermarking has been feasible (like photography) have seen creators get access to some compensation, while text based creators get nothing.
This would have detrimental effects to people who use screen readers or have their own stylesheets of course.
It seems like that would be pretty easily defeatable with the similar mapping to the one used to do the replacement.
The fact that it took until 2024 for the case to resolve shows how long the wheels of justice can take to turn!
Criminal, especially a death row case, can take 20+ years to exhaust every level of appellate review. In Illinois there are at least nine levels of review available to you without going through second rounds of review, state habeas, and collateral attacks like applications for clemency, pardons etc. If you're not paying for lawyers, expect each level to take around two years or more.
To clarify, they spent decades litigating the same fundamental issue for each year’s tax filings, with each filing year taking multiple years to get to court. The plaintiffs won every single case until the government finally settled all the remaining tax years for that amount. Each year prior was worth hundreds of millions.
About judge Bibas: https://en.wikipedia.org/wiki/Stephanos_Bibas
So, in other words, it's good.
The reason why it's valuable is it's transcribed live (usually with video) and is accurate and verifiable. Words and names are spelled correctly and speakers are correctly identified. Court reporters will stop speakers and ask for spelling or to repeat words.
AI transcriptions can't do that.
I’m not sure this signals the end of AI and a victory for the human, but rather who gets to train the models?
Is this type of risk the reason why OpenAI masquerades as a non-profit?
I'm aware this isn't a concern yet, but imagine if the future played out this way....
Or worse: Only those with really deep pockets can pay to get AI, and no one else can, simply because they can't afford the copyright fees.
Only one of the many reasons the legal profession is so expensive.
It shouldn’t surprise the writer that the AI companies’ versions of fair use didn’t hold much weight. They should assume that would be true. Then, be surprised any time a pro-AI ruling goes against common examples in case law. The AI companies are hoping to achieve that by throwing enough money at the legal system.
"But a headnote can introduce creativity by distilling, synthesizing, or explaining part of an opinion, and thus be copyrightable."
Does this set a precedent, whereby AI-generated summaries are copyrightable by the LLM owners?
They would need to figure out a way to prune the respective weights so that such material is not available, or risk legal fury.
Youtube doesn't need to figure out how to stop copyright material from being uploaded, they need to stop it from being shared.
You want to reliably train it away from outputting the undesired outputs, not keeping it ignorant about them.
I wonder how the politics played out. The big AI companies could have funded Ross Intelligence, who could have threatened to sabotage their legal strategies by tanking and settling their own case in TR's favor.
Even before this ruling, Ross Intelligence had already felt the impact of the court battle: the startup shut down in 2021, citing the cost of litigation.
Lawyers are gonna be happy is my thought.
This is going to make Deepseek and its kin much more valuable.
Every AI company using its own created training, resulting in AIs that are similar but not identical, is in my opinion much better than one or very few AIs.
This is going to be one of many cases in which there will be licensing deals being made out of this to stop AI grifters claiming 'fair use' to try to side-step copyright laws because they are using a gen AI system.
OpenAI ended up paying up for the data with Shutterstock and other news sources. This will be no different.
whoever wrote those indemnity policies is going to regret it
Didn't you already share it on GitHub royalty-free?
and other than that, All Rights Reserved
My willingness to upload my projects anywhere is in the historical lows given the current state, honestly.
Your post is totally fine; I just want to save space at the top of the thread (where the parent is now pinned).