> The spirit of the GPL is to promote the free sharing and development of software [...] the reality is that they are proceeding in a different vector from the direction of code sharing idealized by GPL. If only the theory of GPL propagation to models walks alone, in reality, only data exclusion and closing off to avoid litigation risks will progress, and there is a fear that it will not lead to the expansion of free software culture.
The spirit of the GPL is the freedom of the user, not the code being freely shared. The virality is a byproduct to ensure the software is not stolen from their users. If you just want your code to be shared and used without restrictions, use MIT or some other license.
> What is important is how to realize the “freedom of software,” which is the philosophy of open source
Freedom of software means nothing. Freedoms are for humans not immaterial code. Users get the freedom to enjoy the software how they like. Washing the code through an AI to purge it from its license goes against the open source philosophy. (I know this may be a mistranslation, but it goes in the same direction as the rest of the article).
I also don't agree with the arguments that since a lot of things are included in the model, the GPL code is only a small part of the whole, and that means it's okay. Well if I take 1 GPL function and include it in my project, no matter its size, I would have to license as GPL. Where is the line? Why would my software which only contains a single function not be fair use?
who do you mean by "user"?
the spirit is that the person who actually uses the software also has the freedom to modify it, and that the users recovering these modifications have the same rights.
is that what you meant?
and while technically that's the spirit of the GPL, the license is not only about users, but about a _relationship_, that of the user and the software and what the user is allowed to do with the software.
it thus makes sense to talk about "software freedom".
last not least, about a single GPL function --- many GPL _libraries_ are licensed less restrictively, LGPL.
Like if there is no way to trace it back to the original material, does it make sense to regulate it? Not that I like the idea, just wondering.
I have been thinking for a while that LLMs are copyright-laundering machines, and I am not sure if there is anything we can do about it other than accepting that it fundamentally changes what copyright is. Should I keep open sourcing my code now that the licence doesn't matter anymore? Is it worth writing blog posts now that it will just feed the LLMs that people use? etc.
https://github.com/ocaml/ocaml/pull/14369/files#diff-062dbbe...
An inverse of this question is arguably even more relevant: how do you prove that the output of your model is not copyrighted (or otherwise encumbered) material?
In other words, even if your model was trained strictly on copyleft material, but properly prompted outputs a copyrighted work is it copyright infringement and if so by whom?
Do not limit your thoughts to text only. "Draw me a cartoon picture of an anthropomorphic with round black ears, red shorts and yellow boots". Does it matter if the training set was all copyleft if the final output is indistinguishable from a copyrighted character?
That's not legal use of the material according to most copyleft licenses. Regardless if you end up trying to reproduce it. It's also quite immoral if technically-strictly-speaking-maybe-not-unlawful.
It may produce it when asked
https://chatgpt.com/share/678e3306-c188-8002-a26c-ac1f32fee4...
discovery via lawyers
It's much easier to do that for the data that was repeated many times across the dataset. Many pieces of GPL software are likely to fall under that.
Now, would that be enough to put the entire AI under GPL? I doubt it.
On the other side, I deeply believe in the values of free software. My general stance is that all applications I open source are GPL or AGPL, and any libraries I open source are MIT. For the libraries, obviously anyone is free to use them, and if they want to rewrite them with an LLM more power to them. For the applications though, I see that as a violation of the license.
At the end of the day, I have competing values and needs and have to make a choice. The choice I've made for now is that for the vast majority of things, I'm still open sourcing them. The gift to humanity and the guarantee to the users freedom is more important to me than a theoretical threat. The one exception is anything that is truly a risk of getting lifted and used directly by competitors. I have not figured out an answer to this one yet, so for now I'm keeping it AGPL but not publicly distributing the code. I obviously still make the full code available to customers, and at least for now I've decided to trust my customers.
I think this is an issue we have to take week by week. I don't want to let fear of things cause us to make suboptimal decisions now. When there's an actual event that causes a reevaluation, I'll go from there.
The burden is on you to prove that you didn't.
your LICENSE matters in similar ways that it mattered before LLMs. LICENSE adherence is part of intellectual property law and practice. A popular engine may be popular, but not all cases at all times. Do not despair!
Anything you produce will be consumed and regurgitated by the machine. It's a personal question for everyone whether you choose to keep providing grist for their mills.
I also have the feeling it will be much like Google LLC v. Oracle America, Inc., much of this won't really be clearly resolved until the end if the decade. I'd also not ve surprised if seemingly very different answers ended up bubbling up in the different cases, driven by the specifics of the domain.
Not a lawyer, just excited to see the outcomes :).
Democracy is the worst system we’ve tried, except for all the others.
(Also: The GPL can only be enforced because of laws passed by Congress in the late ‘70’s and early ‘80’s. And believe you me, people said all the same kinds of things about those clowns in Congress. Plus ça change…)
If the training is established as fair use, the underlying license doesn't really matter. The term you added would likely be void or deemed unenforceable if someone ever brought it to a court.
But this is all grey area… https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
I can see how it pushes the boundary, but I can’t lay out logic that it’s not. The code has been publish for the public to see. I’m always allowed to read it, remember it, tell my friends about it. Certainly, this is what the author hoped I would do. Otherwise, wouldn’t they have kept it to themselves?
These agents are just doing a more sophisticated, faster version of that same act.
I think this is the part where we disagree. Have you used LLMs, or is this based on something you read?
I don't remember the exact case now, but someone was cloning a program (Lotus123 -> Quatro or Excel???). They printed every single screen and made a team write a full specification in English. Later another separate team look at the screenshots and text and reimplement it. Apparently meatballs can get tainted, but the plain English text loophole was safe enough.
[1] From https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...
> Who can't contribute to Wine?
> Some people cannot contribute to Wine because of potential copyright violation. This would be anyone who has seen Microsoft Windows source code (stolen, under an NDA, disassembled, or otherwise). There are some exceptions for the source code of add-on components (ATL, MFC, msvcrt); see the next question.
This is close to how I would actually recommend reimplementing a legacy system (owned by the re-implementer) with AI SWE. Not to avoid copyright, but to get the AI to build up everything it needs to maintain the system over a long period of time. The separate team is just a new AI instance whose context doesn’t contain the legacy the code (because that would pollute the new result). The amplify isn’t too apt though since there is a difference between having something in your context (which you can control and is very targeted) and the code that the model was trained on (which all AI instance will share unless you use different models, and anyways, it isn’t supposed to be targeted).
Also, humans do not need to read million of pirated books to learn to talk. And a human artist doesn't need to steal million pictures to learn to draw.
They... do? Not just pictures, but also real life data, which is a lot more data than an average modern ML system has. An average artist has probably seen- stolen millions of pictures from their social media feeds over their lifetime.
Also, claiming to be anti-capitalist while defending one of the most offensive types of private property there is. The whole point of anti-capitalism is being anti private property. And copyright is private property because it gives you power over others. You must be against copyright and be against the concept of "stealing pictures" if you are to be an anti-capitalist.
https://en.wikipedia.org/wiki/Cleanroom_software_engineering
You can do whatever you want with the software, BUT you must do a few things. For GPL it's keeping the license, distributing the source, etc. Why can't we have a different license with the same kind of restrictions, but also "Models trained on this licensed work must be open source".
Edit: Plus the license would not be "GPL+restriction" but a new license altogether, which includes the requirements for models to be open.
I suggest a careful reading of the GNU GPL, or the definition of Free Software, where this is carefully explained.
"A work based on the program" can be defined to include AI models (just define it, it's your contract). "All of these conditions" can include conveying the AI model in an open source license.
I'm not restricting your ability to use the program/code to train an AI. I'm imposing conditions (the same as the GPL does for code) onto the AI model that is derivative of the licensed code.
Edit: I know it may not be the best section (the one after regarding non-source forms could be better) but in spirit, it's exactly the same imo as GPL forcing you to keep the GPL license on the work
Using AGPL as the base instead of GPL (where network access is distribution), any user of the software will have the rights to the source code of the AI model and weights.
My goal is not to impose more restrictions to the AI maker, but to guarantee rights to the user of software that was trained on my open source code.
"The freedom to run the program as you wish, for any purpose (freedom 0)."
You are still free to train on the licensed work, BUT you must meet the requirements (just like the GPL), which would include making the model open source/weight.
Like if I copy-paste GPL-licenced code, the way you realise that I copy-pasted it is because 1) you can see it and 2) the GPL-licenced code exists. But when code is LLM generated, it is "new". If I claim I wrote it, how would you oppose that?
[0] https://factually.co/fact-checks/justice/evidence-investigat...
My view is that copyright in general is a pretty abstract and artificial concept; thus corresponding regulation needs to justifiy itself by being useful, i.e. encouraging and rewarding content creation.
/sidenote: Copyright as-is barely holds up there; I would argue that nobody (not even old established companies) is significantly encouraged or incentivised by potential revenue more than 20 years in the future (much less current copyright durations). The system also leads to bad ressource allocation, with almost all the rewards ending up at a small handful of most successful producers-- this effectively externalizes large portions of the cost of "raising" artists.
I view AI overlap under the same lense-- if current copyright rules would lead to undesirable outcomes (by making all AI training or use illegal/infeasible) then law/interpretation simply has to be changed.
Its all about whose outcomes are optimized.
Of course, the law generally favors consideration of the outcomes for the massive corporations donating hundreds of millions of dollars to legislature campaigns.
I think the redistribution effect (towards training material providers) from such an scenario would be marginal at best, especially long-term, and event that might be over-optimistic.
I also dislike that stance because it seems obviously inconsistent to me-- if humans are allowed to train on copyrighted material without their output being generally affected, why not machines?
A lot of it boils down to whether training an LLM is a breach of copyright of the training materials which is not specific to GPL or open source.
This is a big difference that already has bit them.
Lobbying is for people trying to stop them; externalities are for the little people.
Once training is established as fair use, it doesn't really matter if the license is MIT, GPL, or a proprietary one.
https://en.wikipedia.org/wiki/Fair_use#/media/File:Fair_use_...
and it is certainly not part of the Berne Convention
in almost every country in the world even timeshifting using your VCR and ripping your own CDs is copyright infringement
(which is the linch-pin of the sloppers)
Is this legally settled?
With proprietary or more importantly single-owner code, it's far easier for this to end up in a settlement rather than being drug out into an actual ruling, enforcement action, and establishment of precedence.
That's the key detail. It's not specific to GPL or open source but if you want to see these orgs held to account and some precedence established, focusing on GPL and FOSS licensed code is the clearest path to that.
> A GPL license is a contract in most other countries. Just not US probably.
Not just the US. It may vary with version of the GPL too. Wikipedia claims its a civil law vs common law country difference - not sure the citation shows that though.
They could start selling a version of Word tomorrow that gives them the right to train from everything you type on your entire computer into any program. Or that requires you to relinquish your rights to your writing and to license it back from Microsoft, and to only be able to dispute this through arbitration. They could add a morals clause.
For those who are into freedom, I don't see how dictating how you use what you build in such a manner is in the spirit of free and open.
Just my opinion on it, to each their own on the matter.
It's easy as a developer to slip into a role where you want to build/package (maybe sell) some software product with minimal obligations. BSD-likes are obviously great there.
But the GPL follows a different perspective: It tries to make sure that every user of any software product is always capable of tinkering and changing it himself, and the more permissive licenses do not help there because they don't prevent (or even discourage!) companies from just selling you stripped and obfuscated binary blobs that put you fully at the vendors mercy.
I'm of the opinion that what I build, I'm willing to share it and let others use it as they see fit even if it's not to my advantage.
My view is that every project and library where I can peruse the source is a gift/privilege. GPL restrictions I view as a small price to "pay it forward", and to keep that privilege for all wherever possible.
Are you complaining about proprietary software? I hear the restrictions are a lot tighter for Photoshop's source code, or iOS's, but for some reason you are one of the people who hate GPL as a hobby. Please don't show up whining about "spirits" when Amazon puts you out of business.
Corporations have always talked about the virality of GPL, sometimes but not always to the point of exaggeration, you'd think that after getting the proof of concept done the AI companies would be running away at full speed from setting a bomb like that in their goldmine.
Putting in tons of commonly read books and scientific papers is safer, they can just eventually cross-license with the massive conglomerates that own everything. But the GPL is by nature hostile, and has been openly and specifically hostile from the beginning. MIT and Apache, etc. you can just include a fistful of licenses to download, or even come up with architectures that track names to add for attribution-ware. But the GPL will obviously (and legitimately) claim to have relicensed the entire model and maybe all its output (unless they restricted it to LGPL.)
Wouldn't you just pull it out?
I submit the evidence suggests the genAI companies have none of those attributes.
But I'm not certain that the relevant players have the same consequence-fearing mindset that you do, and to be honest they're probably right. The theft is too great to calculate the consequences, and by the time it's settled, what are you gonna do - turn off Forster's machine?
I hope you're right in at least some cases!
Why would the GPL settle? Even more, who is authorized to settle for every author who used the GPL? If the courts decided in favor of the GPL, which I think would be likely just because of the age and pervasiveness of the GPL, they'd actually have to lobby Congress to write an exception to copyright rules for AI.
A large part of the infrastructure of the world is built on the GPL, and the people who wrote it were clearly motivated by the protection that they thought that the GPL would give to what was often a charitable act, or even an act that would allow companies to share code without having to compete with themselves. I can't imagine too many judges just going "nope."
If ultimately copyright holds up against the models*, the GPL will be a permanent holdout against any intellectual property-wide cross-licensing scheme. There's nobody to negotiate with other than the license itself, and it's not going to say anything it hasn't said before.
* It hasn't done well so far, but Obama didn't appoint any SCOTUS judges so maybe the public has a chance against the corporations there.
Haha no.
https://windsurf.com/blog/copilot-trains-on-gpl-codeium-does...
And just in the last two days, AI generating LGPL headers (which it could not do if identifying LGPL code was pulled from the codebase) and misattributing authors:
https://devclass.com/2025/11/27/ocaml-maintainers-reject-mas...
That first link shows people actively pulling out GPL code in 2023 and marketing around that fact, though. That's not great evidence that they're not doing it now, especially if testing for if GPL code is still in there seems to be as easy as prompting with an incomplete piece of it.
I'd think that companies could amass a collection of all known GPL code and test for it regularly in order to refine their methods for keeping it out.
> (which it could not do if identifying LGPL code was pulled from the codebase)
Are you sure about this? Linking to LGPL code is fine afaik. And why not train on code that linked to universally available libraries that are legal to use? Seems like one might even prefer it.
Seems like this was rejected for size and slop reasons, not licensing. If the submitter of the PR isn't even fixing possibly hallucinated author's names, it's obvious that they didn't really read it. Debugging vibe coded stuff is like finding an indeterminate number of needles in a haystack.
It's just a side cost of doing business, because asking for forgiveness is cheaper and faster than asking for permission.
{ "includeCoAuthoredBy": false }
The "enforceability" of the GPL was never in any doubt because it's not a contract and doesn't need to be "enforced". The license grants you freedoms you otherwise may not have under copyright. It doesn't deny you any freedoms you would otherwise have, and it cannot do so because it is not a contract. If the terms of the GPL don't apply to your use then all you have is the normal freedoms under copyright law, which may prohibit it. If so, any "enforcement" isn't enforcement of the GPL. It's enforcement of copyright, and there's certainly no doubt on the enforceability of that.
For the GPL to "fail" in court it would have be found to effectively grant greater freedoms than it was designed to do (or less, resulting in some use not being allowed when it should be, but that's not the sort of case being considered here). It doesn't, and it has repeatedly stood up in court as not granting additional freedoms than were intended.
That case was important, but it's not abojt the virality. There have been no concluded court cases involving the virality portion causing the rest of the code to also be GPL'd, but there are plenty involving enforcement of GPL on the GPL code itself.
The distinction is important because the article is about the virality causing the whole LLM model to be GPL'd, not just about the GPL'd code itself.
I'd like to think it wouldn't be a problem to enforce, but I've also never seen a court ruling truly about the virality portion to back that up either - which is all GP is saying.
It's sad to see Microsoft's FUD still festering 20 years later.
The difference between a license and a contract may be too subtle for the denizens of HN to grasp in 2025 but I assure you it's not lost on the legal system. It's not lost on those of us who followed groklaw back in the day, either. Sad we have to live with an internet devoid of such joys now.
You're basically saying "the GPL doesn't go back in time and relicense unrelated code." But nobody was ever claiming it does, and describing it as "viral" doesn't imply that it does. It's "viral" because code that you stick to it has to conform to its rules. It's good that the GPL is viral. I want it to be viral, I don't want people to be able to hide GPL'd code in a proprietary structure.
What you're calling the "virality portion" says that one of the ways you *are* allowed to use the code is as part of other GPLed software. If you're going to look for court cases that explicitly "involve" that, it would have to be someone either:
* using it as a defense, i.e. saying "we're covered by the GPL because the software we embedded this code in is GPL" (That will probably never happen because people don't sue GPLed projects for containing GPLed code), or
* coming into line with the GPL by open sourcing their own code as part of resolving a case (The BusyBox case [2] was an example of that).
If you just want cases where companies that were distributing GPL code in closed source software were prevented from doing so, the Cisco [1] and BusyBox [2] cases were both notable examples. That they were settled doesn't somehow make them a weaker "test of the GPL" - rather the companies involved didn't even attempt to argue that what they were doing was permitted. They came into line and coughed up. If you really must insist on one where the defendant dug in and the court ended up awarding damages, I don't think there have been any in the US but there has been one in France [3].
As for "nobody was ever claiming it does", the "viral" wording has been used for as long as the GPL has been around as a scare tactic for introducing exactly that erroneous idea. Even in cases where people understand what the license says, it leads to subtle misunderstandings of the law, which is why the Free Software Foundation discourages its use. (Also, you literally said, in these exact words, "the virality causing the whole LLM model to be GPL'd".)
[1] https://en.wikipedia.org/wiki/Free_Software_Foundation,_Inc.....
[2] https://en.wikipedia.org/wiki/BusyBox#GPL_lawsuits
[3] https://www.dlapiper.com/en/insights/publications/2024/03/wa...
The Cisco case was about distributing GPL binaries, not linking it with the rest of the code base and the rest of that code base then needing to be GPL. It's a standard license enforcement unrelated to the unique requirements of GPL.
The BusyBox case is probably the closest in the list, but as you already point out we didn't get a ruling to set precedent and instead got a settlement. It seems obvious what the ruling would be (to me at least), but settlement was explicitly not what is being talked about.
Bringing in French courts, they issued fines - they didn't issue the type of order this article talks about which is about releasing the entire thing involved at the time with GPL.
This isn't related to fear, uncertainty, or doubt about GPL. It's related to what has/hasn't already been ruled in the court systems handling this license before as the article skips past a bit. Even in the case we assume the courts will rule with what seems obvious (to me at least), it has a tangible difference in how these cases will be run, the assumptions they will have, and how long they will last.
Conversely, to my knowledge there has been no court decision that indicates that the GPL is _not_ enforceable. I think you might want to be more familiar with the area before you decide if it's legally questionable or not.
You are then restricted by copyright just like with any other creation.
If I include the source code of Windows into my product, I can't simply choose to re-license it to say public domain and give it to someone else, the license that I have from Microsoft to allow me to use their code won't let me - it provides restrictions. It's just as "viral" as the GPL.
Also, "don't use my code" is not viral. If you break the MSFT license, you pay them, which is a very well-tested path in courts. The idea of forced public disclosure does not seem to be.
If the GPL license didn't exist, and instead you just relying on copyright, then that's an injunction. You have to stop using the code you "stole" and pay reparations.
In UK law, if you distribute copyright material in the course of a business you can be facing 10 years in prison and an unlimited fine.
Sure you can't get them to agree to the GPL, they could simply stop distributing and then turn up to their stint in prison and massive fine. In reality I suspect they would take the easy way out and comply with the license.
Corporations can't go to prison.
The other factor of copyright, which is relevant, is how material is obtained. If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use. If you don't want AI training to be done on your work, you need to put access to it behind explicit authentication with a legally-binding user agreement prohibiting that use-case. Do note that this would lose your project's status as open-source.
Well the difference is that copyright law applies to work fixed in a tangible medium of expression. This covers i.e. model weights on a hard drive but not the human brain. If the model is able to reproduce others’ work verbatim (like the example the article brings up of the song lyrics) then under copyright law that’s unauthorized reproduction. It doesn’t matter that the data is expressed via probabilistic weights because due to past lobbying/lawsuits by the software industry to get compiled binary code covered by copyright, reproduction can include copies that aren’t directly human readable.
> If the material is publicly accessible without protection, you have no reasonable expectation to exclusive control over its use.
There’s over 20 years of successful GPL infringement lawsuits over unlicensed use of publicly available GPL code that disagrees with this point.
we also treat as however we want public goods found over the internet. as the World Intellectual Property Organization Copyright Treaty and Berne Convention for the Protection of Literary and Artistic Works aren't real or because we can as we are operating in international waters, selling products for other sails living exclusively in international waters /s