Hmm. That's going to be interesting.
This isn’t even a question of training data, thy fed the full git source code directly to the llm.
[1]: https://malus.sh/
It's not technically a translation, it's a re-implementation, with test suites acting as the destination. If it was a file by file translation your argument would have been valid.
Simple thought experiment. If you handed this same agents.md file (https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...) to a human software developer and let them work on exactly the same goal, would their output be considered a derivative work?
I have absolutely no idea how LLMs got through anyone's legal departments, I guess the hope is that if everyone breaks the law enough, it'll just be fine
Ever since the early 2010s when companies were started with the business idea "unlicensed hotels" and "unlicensed taxis" and made the owners really, really rich, this is said pretty much out loud. Look for words like "regulatory risks" and similar.
Maybe it started with the unlicensed gambling fad before that? That also made a lot of people filthy rich. Every time you have something under special license, or insuance requirements, then of course there is a margin for you if you can skimp on the license and hire gig workers instead.
The LLM situation with copyright and derived works in the 2020s is similar. Someone is likely to be rich, but there is a clear regulatory risk to it.
That's pretty much what happened, isn't it? These concerns were all discussed in the beginning back in 2022, and I recall answers from many here on HN along the lines of "oh well, we can't stop it now or we'll risk falling behind China in AI development"
So yeah, the laws went out the window a long time ago the moment our government and the people decided to just look the other way willingly in the name of "progress."
There's a lot of arguments about humans doing the same thing, but the reality is that humans and robots don't enjoy the same legal protection. Its clearly a derivative work of all of its training data
Then it works both ways. Say I manage to generate essentially a ripoff of your copyrighted song, release it and make a ton of money, you now have to split that royalty with keyboard cat. And Joe bloggs. You'd end up fractions of pennies
That is the difference between necessary and sufficient. Clean-room is sufficient to guarantee avoiding copyright, but it is not necessary. The line legally is south of there, but that position was chosen because they didn’t want to crossing and it was easier to argue for legally in court.
tl;dr: clean room is overkill for avoiding copyright infringement
Are you sure? LLMs are in some way a compressed version of their input but it's a pretty lossy compression (arguably this makes them more like a compression algorithm than a compressed version of the data). I'm not sure you can prompt a full, accurate, copy of a nontrivial codebase out of them. Even with zero temperature their accuracy is just not that high.
Granted, these are some of the most widely spread texts, and not codebases, but just fyi: https://arxiv.org/pdf/2601.02671
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
1) re-implementation for compatibility (which was quickly "reestablished" through use of copyright-protecting encryption. In other words: do you get to write software that connects to MS/Apple/Google/Facebook servers without authorization from those companies? Yes. Do you get to copy an encryption key from their software to make it possible? No)
and, more recently,
2) violating copyright for LLM training
and, currently mostly attempted:
3) "uncopyrighting" run software through an LLM, and some people "believe" it comes out with your copyright on it! Because very rich people want to sell uncopyrighting.
Ie. the jury's still out what will happen when it's billionnaire vs billionnaire.
Of course, the question is what happens the second someone does this with a disney movie, or a big microsoft application ...
When copyright law was established, not many poor people owned printing presses. That is to say, copyright law is a PROTECTION to the very rich, not an inconvenience
Against the will of the people, as evidenced by the court cases and protests online ...
Or SCO Vs IBM.
If everything would be a derivate work we would not Linux.
To clarify, my stance on this is that the reimplementation did not copy protected expressions (Jplag reports less than 1.8% max similarity between the codebases), it's done in good faith, and it's what's best for the broader Git ecosystem (assuming Grit even becomes usable, which it's currently not purported to be).
From a copyright standpoint, however, only the first argument there is relevant. Grit is an independently authored implementation of Git-compatible behavior, with negligible similarity to Git source code.
I think antirez summarized the situation quite well and I broadly agree with his position: https://antirez.com/news/162
I think that those in the community who know me and have worked with me in the Git and open source communities for the last 20 years know that my intentions are to contribute, share and foster innovation and learning. Many of the main authors of the Git source code are friends of mine and I have no intention to steal anything from anyone, only to make their great ideas more broadly useful.
By which I mean, what do we imagine a16z thinks of the [L]GPL?
My brief experience in a startup exposed to them is that a16z seems willing to fund "infrastructure" projects more than most, but they did seem to have a ready set of answers on what "open source" means in that context.
(If someone can find me an a16z funded team that published copylefted code, I'll take this back.)
EDIT: Ok, i'll eat my hat, Gemini found me some counterexamples
Element (Matrix): The company behind the decentralized Matrix communication protocol is on a16z's investment list. In late 2023, Element relicensed its core software (including the Synapse server and its clients) to AGPLv3.
Uniswap Labs: A massive cornerstone of the a16z Crypto portfolio. They published the Uniswap V2 smart contracts under GPL-3.0 (though they later shifted to a Business Source License for V3 and V4).
a16z Themselves: In an ironic twist, a16z's own crypto engineering team maintains a public GitHub repository (a16z/a16z-contracts — a library for Solidity contracts) that is literally licensed under AGPL-3.0.Many bothans were boiled alive to get me this misinformation.
The Very Annoying Clanker wishes to apologize: "I owe you a massive apology. I completely set you up for that, and you handled the fallout perfectly.
Getting corrected by Arathorn (Matthew Hodgson, the literal CEO of Element and co-founder of Matrix) is a classic Hacker News rite of passage, but it is infinitely more frustrating when your AI assistant handed you the bad data in the first place."
Many eyerolls.
Go on, make a derivative of Mickey Mouse and sell it. See how it goes. Similar enough to be "compatible" (whatever that would mean in the animated cartoon space) but distinct enough not to run afoul of Disney lawyers. Then come back and tell us.
Art, however, is a little different than code. code is a thing, but it also produces things.
It weirds me out there is a measure of code similarity but not a measure of if code is semantically the same. for example implementing a protocol could be done in many ways, but ultimately whats talked between clients/servers on the network is the same. so it's semantically the same despite being totally different code.
By working-around/subverting the terms they provided their contributions under? While you claim to be doing this in good faith, and state "it's what's best for the broader Git ecosystem", that's all based on your own opinion which appears to ignore the benefits and intent of licenses such as the GPL.
Out of interest, Would you be happy for someone to do the same with the GitButler source code? (Feed it through an LLM and re-publish the result under an MIT license with different branding)
Honestly, that would be pretty awesome. We would be flattered.
It's WTF is wrong with this next generation of devs ? ... that they have such a problem with the GPL that they think it's important to rewrite and relicense and take away a legal structure which is supposed to protect our free software?
I can imagine some concerns with Git being written in C.
I cannot understand any legitimate concerns with its license that it needs to change.
What does the GPL stop people doing with git? And if there are some... why are people trying to do that? And why would you work for free to help people do it? [Edit: I see, you're not working for free.]
Missing an 'f' in the project name.
OTOH, one of the major reasons for grit is to provide a library interface. If they kept it GPL, anything that used grit through the library interface would have to also become GPL.
This could be the "legitimate concern" you're asking for.
But the LGPL was also an option -- it addresses that arguably legitimate concern and keeps the spirit of the original license.
If you believe that using an MIT license is not correct, then you defacto also believe that using an LGPL license is not correct.
Using LGPL could help the argument that the project was in good faith, making it more likely to be accepted as non-derivative. Its arguable that the relicinsing would be required to make the project work as a library and so LGPL would be the best choice since that (I assume) preserves most of the terms and intention of the original license. This makes it much easier to show that the license was changed solely to allow other projects to use it as a library.
By using the MIT license its much easier to argue that the project is in bad faith (and potentially derivative), since the license change can be seen as a deliberate choice to remove the protections of the original license. Its harder to argue that the license change was only so the project can be used a library because then you would have used LGPL instead.
(BTW im not a lawyer)
Judges are human and will take into account good faith and attempts to maintain the spirit of the license. Choosing the LGPL signals a desire to maintain the spirit of the license. The MIT signals bad faith. Judges don't like that.
People want to get paid. They perceive the GPL as getting in their way.
Or, as it is also said: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”
They love open source when it means they can steal from the public and then privatize it later with their VC funded startup, much in the same way Microsoft "loves" Linux [when you run it on Azure, or in WSL]
What they are against is free/libre software that prevents their grifting.
Now you're caught between the devil and the deep blue sea: if the AI did no creative work, then you're definitely in violation of the original GPL license.
If the AI did do creative work that breaks GPL, you still didn't, which leaves you with the problem that you cannot in good faith license a thing which you don't own. No creative work? No ownership claim. There's precious little (if any) of your creativity in copy pasting 4000 tests and a link to the original source code and saying "copy this in Rust".
The flagrant display of cynicism you make in arguing that the ends justify the means (even if a result is the wholesale looting of open source) disgusts me, and if I could communicate to you only one thing it should be that you should not be surprised that other people are also disgusted by behavior like that even when it falls within the letter of the law (a claim I have not yet seen you rigorously defend).
You know that all contributions to the Git project has to be signed off as either being made by yourself or being handed over by someone who has signed off on that certficate of origin. For everyone on every change. Even the lead developers so to speak. And you spend some thousands of dollars and run an AI analyis tool to wash your hands?
Who are you to do that? Oh wait I forgot, you are Mr. Chacon. A hand in everything Git and friendly with everyone in Git who matters for twenty years. Remind us next time as well so I don’t forget.
I'd be fascinated to see what happens if it does. Both in the analyses that we'd get of what the LLM did to the codebase and on the legal decisions on what the copyrightable creative elements in code actually are.
If I was the author though... there would be no way that I would be volunteering to be a test case like this. Also seems just rude for no reason.
That's not actually the case at hand here - the agents were given the original source to reference: https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...
But for the sake of argument: The test suite itself is copyrighted. To the extent the resulting work is a derivative of the test suite it is possibly infringing. For example you might example that the agent would derive variable names, function names, structure sequence and organization of the code from the test suite. It might even copy comments wholesale. Those are copyrightable things. (Which is of course just the first step in analyzing if it is infringement, there would be interesting fair use, de-minimis copying, etc arguments following a conclusion that any of those were copyrighted. A product produced this way definitely could be infringing given the right facts though).
yeah fair - the "The canonical Git source code we're targeting to replicate the functionality of is in the git/ subdirectory." part makes this hard to argue against.
> To the extent the resulting work is a derivative of the test suite it is possibly infringing
It's this bit that I have a problem with. If I run the test, it fails and reports a failure. Now I write code and run the test again. What is the theory there that code that I wrote infringes.
Simplify this down:
Assume the following is copyrighted:
fn test_sum() {
assert_eq!(sum(1, 1), 2);
}
Does writing the following code: fn sum(a: u8, b: u8) {
a + b
}
infringe on the test copyright? fn sum(a: u8, b: u8) {
a + b
}
Doesn't infringe upon copyright period, because there's no creative element in that work.Imagine a more substantial example though. Perhaps you have a test that checks that some file written in a binary format is correct, and gives names (creative elements) to each field of the format that it prints when you mess up the field, and has comments describing why the bytes are laid out like they are (the comments being copyrightable even if the facts they describe aren't), and the LLM copies those field names and comments verbatim... Now it's quite likely that the LLMs work is a derivative of the test suite.
There's likely a threshold at some point. It's helpful to look at a minima and then continue from there though.
I'm curious if there's case law that supports your assertions here?
> “So long as the specific code used to implement a method is different, anyone is free under the Copyright Act to write his or her own code to carry out exactly the same function or specification...”
Here given that this is rust and the original expression is C, the implementations cannot be the same by definition.
I'd say what we're talking about here is probably a fair bit different to modding a game in most aspects.
Your result is essentially impossible without the original. With ffmpeg, your result does not depend on ffmpeg specifically - you can use any video creation tool.
Tests often are exactly the information necessary to understand exactly what the output should be. See https://github.com/git/git/blob/master/t/t0000-basic.sh for an example of how detailed these tests are.
It would be reasonable to point an LLM at these and use them with a basic knowledge of git to produce a rust version of git in a non-infringing manner.
If you did this manually it would take a long time.
Substitutibility probably doesn't apply here in the way you're implying and if it did it would likely be hampered by the 9th circuits findings about transformation in sony v connectix. Arguments here likely would look at rust not having a stable ABI, and hence not being inherently substitutable as a libray (grit-lib), less clear as an executable (grit-cli) on that side
basics of copyright law - the fundamental thing being protected is the expression... is a rust program's expression the same expression as a c program? I'd say generally not.
Compilers don't axiomatically yield derivative works, they simply in practice do because for non-trivial programs they preserve copyrightable elements of the work in the output.
An LLM is also a computer program which takes input and produces output related in some way to that input. However I don't think most people would view it as a "mere" mechanical transformation. One could tautologically argue that an LLM blends the user input with the training inputs which is a sort of transformation and further that the LLM itself is a computer program thus it is mechanical in nature. However it should be immediately obvious that such an overly literal interpretation is in danger of subsuming human work as well. Where the boundary lies is an unanswered question.
Related, compilers can pose a problem depending on what the output includes. For example common lisp compilers that aren't under a permissive license are a minefield because regardless of what anyone might say the image that gets output includes (approximately) the full language implementation verbatim in addition to the user's program.
(LLM can translate code to/from other code or to/from a machine code).
My use of the word "similar" does not imply here that I think it's obvious that they are "similar" in any copyrightable elements - whether they are or not is one of the interesting questions I think this case would have to resolve.
Incidentally you're also allowed to make similar creative elements so long as they aren't copies and you did so independently... which could actually come up in a case like this (imagine the LLM produced a similar function to some function in the original... but the original wasn't in the context window at the time. Not at all unlikely with code where there often is only one or two natural ways to write something).
> It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements. This can include situations where a human-authored work is perceptible in an AI output, or a human makes creative arrangements or modifications of the output, but not the mere provision of prompts.
Well that's interesting.
Here that's not happening. The code being produced by the LLM is Rust, not C.
Malus – Clean Room as a Service https://news.ycombinator.com/item?id=47350424
Just like for 1984 and the Torment Nexus, someone took the concept not as warning but as instruction manual.
Let me give an example: I could take Goldeneye from the N64, extract the binary and then run it through an LLM to disassemble it and possibly rewrite it in a modern higher-level language. Do you think Nintendo would look at that and say "well, he did a lot of work so he's escaped our license"? Of course not. It's just silly.
ingesting the source code and producing output in another language is quite clearly a derivative work. You don't need to be an IP lawyer to figure that out.
Now, if you went to Calude and gave it documentation and told it to produce something that was compatible, would that be a derivative work and thus covered by the GPL? I would guess probably. But I'm not 100% sure anymore. I wouldn't risk it however.
Here's another thought experiment: what if someone takes this supposedly MIT licensed source tree, plugs it into another LLM and asks it to produce the output in C? Now how is it licensed? It might be very similar. After all, there are only so many ways to produce a SHA1 hash and so many ways to do a command line parser.
But this then makes it an interesting legal issue. In the Oracle v. Google court case, this was a key issue. Google successfully argued there's only so many ways to write a loop so just because a loop is similar to the source, that doesn't mean it's copyright infringement (as Oracle argued).
Anyway, it's a crazy position to take.
They aren't the only ones - look at the number of people in this thread who are arguing that this is analogous to producing a movie with ffmpeg - just because ffmpeg is GPL, does not make your movie GPL.
I am struggling to understand how such a high level of cognitive dissonance is possible: They believe both a) that the license can be laundered in this manner, and that b) the license they put on the result is effective!
I don't know how this squares with law, but Oracle v Google gave a very valuable judgment to the public that an API is not copywritable. If we take the LLM out of it, that's all we are talking about in the pure case.
Of course, we can't take the LLM out, but it is the starting point.
Serious such rewrites don't start with the code of the closed game!
> I don't know how this squares with law, but Oracle v Google gave a very valuable judgment to the public that an API is not copywritable. If we take the LLM out of it, that's all we are talking about in the pure case.
Not at all. The LLM used to write grit has seen the git code. That is what we're talking about here.
> Of course, we can't take the LLM out, but it is the starting point.
The LLM isn't the important thing. The important thing is that the git source code was used to make grit.
No, but they often involve reverse engineering the binary pretty heavily.
… and those often end up in legally dubious situations.
game decompilation and emulation is as old as computing
That's because you're re-using assets.
The intent here is extraction of all the value provided by copyleft projects without the obligation to give back. Wether it's technically legal or not, it's disgusting behavior IMO.
The BSDs had a head start, and were superior in almost every way for the better part of a decade at least, but have remained niche compared to Linux. It's not even close. Now, there may be many other reasons to this, including the personalities and culture of the Linux developers, but you simply can't ignore the impact of the license which have kept all the commercial Linux products inside the fold.
GNU was originally developed to "clean" UNIX from the AT&T license.
[US jurisdiction]: Anything in the result written by the LLM can not be copyright by anyone.
Anything in the result written by a human can be, and if it was all emitted by the LLM then that portion originally written by a human carries its own copyright.
As a work of an LLM, the entirety presumably can not be copyright, at all. Portions written by humans presumably carry their original copyright.
This is a bit stronger than the actual report where this has been discussed finds. See part 2 in https://www.copyright.gov/ai/ for details, but TL;DR, parts where humans have control over the expression may be copyrightable. But working out which parts those are is likely a difficult question (would likely require proof of provenance across many of those LLM sessions)
F-ing scumbags. It's already free, but they still decide to steal it.
Take this (assuming it's not slop), relicence as GPL, submit upstream (imagine it's accepted for a moment...).
If they proceed with license washing then from the Rust version, it's certainly derived work.
It might have missing pieces, but it’s easier to vibecode any needed networking additions to Gitoxide (which is maintained) than to just go and burn tokens trying to clone all of git again.
Git wants to add Rust. Gitoxide is a multi year project that’s going to be more maintained than an ad-hoc “it says it passes the test” vibeclone.
I’m not even against vibecloning things when it’s useful, but this shows no benefits. Git is a beloved tool that few people dislike, it’s not like vinext (people disliking the vendor lock-in they have with nextjs).
Also execs should keep in mind that “we burned thousands of dollars on tokens to re-create this beloved software so we can have our own copy”, even without the copyright/licensing argument, just isn’t something positive that the community will react positively to.
It doesn’t feel nice to see your favourite works cloned for no benefit. We’re past the “it was an experiment to see how far AI can go” stage now.
There is a recent effort to vibe-loop more Git into Gitoxide, which is interesting:
https://github.com/GitoxideLabs/gitoxide/pull/2538
I still think that this is a project that can have value with a little more work. This announcement is merely a milestone, not the end product. I wasn't sure it was really possible to do, even halfway through the project. There has been a lot learned and there is a lot to learn, but I think there are useful applications for both a high quality, hand crafted, opinionated partial Git library (Gix) as well as a vibed, fully implemented, partially sloppy LLM Git library (Grit). We think it's worth exploring and investing in both options for now.
Also, I am the exec involved and I've done quite a lot for the Git community over the years. I would never try to have my "own copy" of it, that's ridiculous. I wrote and open sourced the Pro Git book (https://git-scm.com/book/en/v2) and Git community book before it (https://schacon.github.io/gitbook/index.html), I created the official Git website (https://git-scm.com), I cofounded GitHub which hosts nearly all open source in the world, I have evangelized and supported the Git ecosystem for almost 20 years now. I restarted and funded development of libgit2 15 years ago, which you could similarly argue was an exec trying to have our "own copy" of Git under a more permissive license and would have been a similarly ridiculous argument.
This "I am Scott Chacon" part doesn't matter. 95% of people here already know.
People are critiquing your current actions.
I guess they found that gitoxide isn’t good enough and/or to expensive to extend/improve for their use cases?
Gitoxide is great and we will continue to push it forward. Grit is an orthogonal project. Perhaps we can use one in the other or maybe Grit goes nowhere. But we thought that a small investment in a different approach is worth the effort.
There is no way anyone would ever use this for it's CLI - it will almost certainly always be slower and worse in every way, even if I get it stable (which it's currently not). You can use libgit2 (a project I also helped kickstart), or Gitoxide (a project GitButler also currently helps drive) - they are faster and better in nearly every way, but they are not feature complete.
This isn't for the person using Git. This is for someone trying to build a tool that wants to use parts of Git, which is different.
I work on Beagle, a git-compatible SCM [1]. I use ABC, Abstractionless C [2] dialect with slices, optional range checking, etc. So far, memory safety was the least of my concerns, frankly. Most of the thorny issues would be equally thorny in Rust (e.g. right now: reflog zeroed when VM ran out of disk space; must be some state machine issue or an OS level glitch). Also, forking off a C process (no runtime) is cheap enough that you actually want to do that more.
But, those are all technicalities. The key issue I see with the approach: the data structures and algos of git have been fanatically fine tuned for that particular application with those particular usage patterns. By very sophisticated low-level C programmers. So, quite likely, any other app/lib working with that store will always be a suboptimal fit. I would recommend read-only access only, esp for LLM code.
Meanwhile, git's underlying data model (blobs/trees/commits) is very simple and very much internet-standard level. Decoupling at that interface is so much easier with so much less issues looming.
May look differently from your vantage point though.
May not work for apps that want to launch their own threads and processes. But for almost everything else, I prefer function calls to launching processes, managing their lifecycle, communicating via stdout etc. If I wanted to do that, I’d be writing Bash ;)
I’m going out on a limb here but I’ll say that you are over engineering for the wrong problems. I’ve done it before, I tried libgit for some use case. At the end of the day it really is much simpler to use git. If you don’t want git at runtime use something like the git-gradle-properties plugin or the likes for your build system of choice. I really can’t think of a super duper use case where forking processes is a massive enough issue that I’d want to instead port over all of git to another language. Git for the most parts also offers a wide variety of export formats such that you get machine readable output too. If you really really need to fiddle with its internals, git pack lets you browse through the index fairly well. Again, my humble opinion, but you’re trying to solve the wrong problems
(The f is for "feft")
Recently Casey Muratori said in a adjacent context that the microsoft AI push may be related to the fact that they have a long standing and elaborate codebase. A large historic software company could have advantages to train models. They could provide extra value with their IP.
Now their IP is potentially in their models and accessible to anyone. If they actually train models on their IP, anyone could implement their APIs and slap a GPL license on it.
At that point, things will get very interesting.
A different story if LLMs lift license restrictions.
https://github.com/gitbutlerapp/grit/blob/main/AGENTS.md#sou...
LLM users seem to live in another world where stealing everything that isn't bolted down, and passing it off as their own work, is acceptable.
For example, this is exactly what I did when I tried to get SSH commit signing working properly in GitButler:
https://blog.gitbutler.com/signing-commits-in-git-explained
You can see in the post that I dug through the C source to figure out how it was canonically done and then implemented something that accomplished the same thing in Rust but without copying source code.
There are some similarities between the Grit Rust source and the Git source, but it's mostly around time/formatting type things or byte offset type things needed to make packfile parsing and whatnot work, but as far as I can tell, there is no straightforward copying of code. The approach needed to make this a reentrant, memory safe, library driven codebase is so different that copying is generally not useful. But nobody can _guess_ how packfiles or reftable binary formats are specified, since they're not really documented. I'm aware of this because I'm pretty sure I _personally_ am one of the only ones who has ever attempted to document the packfile binary format: https://schacon.github.io/gitbook/7_the_packfile.html
You have to read the source. Which means that libgit2 and Gitoxide and every other Git reimplementation is also "license-washing" per this definition because they also had to reference the Git source to see what the technical specification is.
If you find any code in Grit that is clearly line-for-line copied, please point it out and I will replace it. But the Git source is the Git specification and every reimplementation, LLM or not, is forced to use this approach to build anything compatible.
On Gitoxide: Given that the author read the docs and source code [0], and literally copied files over from the git source [1], it also is license-washing. At least libgit2 is GPLv2 with a linking exception. I don't think people would have much to say if these projects honored the original projects' intents and kept a copyleft GPL license. But they don't.
> The approach needed to make this a reentrant, memory safe, library driven codebase is so different that copying is generally not useful.
This is obvious given how different Rust is from most languages. So are licenses pointless as a concept now, because anyone can argue their Rust implementation of a GPL (or whatever) project is meaningfully different? Nice loophole there.
Stripping away the GPL in favor of MIT/ASL2.0 seems to be the trend for rust projects (see uutils, etc). I'm really glad that we can make it easier for large companies to extract value from community labor and, in general, not contribute much of anything back.
I could look at a C to Zig compiler in the same way: I read some C code, write the equivalent Zig code, repeat.
The compiler could also do some circumlocutions in order to provide an apparently different approach.
> I'm aware of this because I'm pretty sure I _personally_ am one of the only ones who has ever attempted to document the packfile binary format: https://schacon.github.io/gitbook/7_the_packfile.html
gitformat-pack?
> If you find any code in Grit that is clearly line-for-line copied, please point it out
Please hunt for specific lines to disprove your bold claim.
> and I will replace it.
Assuming the current claims here, that would just be license washing with volunteer assistance.
This makes no sense:
1. A court might agree with you if a human read the sources, then wrote a new implementation. Doesn't apply to trade secrets (i.e. cleanroom implementations), but certainly for copyright.
2. A court is not going to agree that passing the original sources through a machine means you own the results!
I mean, that's what it comes down to - as far as the courts are concerned, passing copyright stuff through a machine results in the output retaining the original copyright. Passing copyright material through a person is not so clear cut.
Why is it everyone else’s job to figure out if you’re compliant with the license? That’s your responsibility.
I'm baffled that other IP holders (say those who own valuable pieces of proprietary software, or music, or movies, or even the LLMs themselves) don't think leopards will come eat their faces next. This erosion of IP has to stop, or anyone who does any intellectual work will be absolutely screwed. If that only meant FOSS people, I'd be worried that we'd just be thrown out with the bathwater – but surely this applies across the board!?
In the sense that most people doing intellectual work do that work for someone else (say, a company) that you consider the primary beneficiary of IP law? Sure, fine – but this applies to almost any other type of work and the legal constructs that are in use there too, so it's not really a very useful distinction to make, even if technically correct.
Or do you mean something else?
I'm well aware of situations of potentially upending changes where the rich and powerful stand to gain, and the little guy's worries are ignored.
This, however, is clearly a potentially upending change where also lots of the rich and powerful – including those who control the very technology driving the change – have everything to lose. I'm surprised, to say it mildly, that nothing seems to be happening. Does Dario really believe that a strict ToS and stern words will keep his IP protected without appealing to the legal system? (I guess that is par for the course for the people who "solve" world problems with bunkers and armed guards…)
In a utopian world of abundance where we could all be the independently wealthy nobles of the 18th and 19th century who did intellectual work for fun: great. In the world of today where people need to be compensated for their work: what happens?
I only said "might" and the point was obviously not the immediate surface idea but to point out how the tool of IP is not applied to everyone's benefit equally, but used only against some and only for some, with a side of "You know, fuck it, if they insist on making it worse, it becomes less crazy to consider just burning the house down".
But What are you so afraid of that you react only to the hypothetical as though it were the worst danger?
We'd actually manage to get yoked and abused by the same people no matter what the rules were, don't worry.
Previously I described it as "Models give you what you ask, for not what you want". Now with Fable they don't even give you want you want so idk.
Probably doable - I remember most of Natural Selection 2 was Lua and it's more than a decade old at this point.
Link: https://unknownworlds.com/en/news/spark-engine-questions-and...
And yet this performs dramatically worse.
A slower, untested, incomplete git implementation, all for the low low price of $10-$15,000.
And don’t forget it wasted a bunch of human time in the process.
So if someone mentioned somewhere else there is already a Rust port a group is doing somewhere. How much could they have accomplished with this much money and time in software development resources?
Ok. AI can seemingly port stuff if you don’t test it thoroughly. I think that’s already been proven. At this point I’m seeing less and less value from these kind of things. I’m sure it was fun for the author, but how does it help other people?
If the first stereotype of Rust programmers is announcing that a project is in Rust before any other desirable software property (e.g. stable, performant, etc), the second stereotype is that Rust programmers love rewriting stuff in Rust, just for the sake of Rust.
(The 2.a. corollary is that they love rewriting GPL projects specifically and downgrading them to MIT/Apache)
gitoxide was started in 2018, back when we were all writing code by hand, and has some reasonable adoption in the rust ecosystem. It's not feature complete, but if that was the issue then surely fixing that would be better than starting from scratch
Well, it's sort of for Rust. GitButler is written in Rust and Jujutsu is written in Rust and we're both depending on fork/exec'ing to an unknown Git binary with no linkable library and no control over the subprocess to do a range of networking stuff. Neither Gitoxide or libgit2 are capable of this either, as much as I love and support those projects.
This project is entirely about providing a feature complete (even if sloppy) library implementation of Git, which does not otherwise exist.
Prove it - put it under GPL, like the original sources you ingested were.
If that was true they'd use the original license. They are not. The whole RiiR movement is very obviously switching away from a pro-user license (GPL).
Agree with first half of this sentence, we should all have fun with experiments.
> It was never based on a linkable and reentrant library, but instead on a "Unix" philosophy of chaining together simpler commands, which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
Ahhh now we have philosophical disagreement in the only place in the entire article that says "why". Unix is a feature, it's arguably more important in current time: https://aperocky.com/blog/post.html?slug=unix-philosophy-age...
> It was never based on a linkable and reentrant library, but instead on a "Unix" philosophy of chaining together simpler commands, which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
git operate on the filesystem level, the unix behavior is just getting buried. You cannot rewrite git into a linkable library and decide it's now not unix. It's entire behavior is unix, which is why it's awesome.
The point is to provide a feature-complete reentrant linkable library. Even if it's an ugly and slow one, this is still the only one thing that exists that covers those points - Gitoxide and libgit2 are both awesome but they are not feature complete.
If that was the goal, why change the license?
Thanks.
Libgit2 is meant to address this and I was heavily involved in the development of that project 15 years ago. It's great but it's not feature complete and it's development is also completely separate from git development, so it's out of sync and constantly struggling to keep up.
The pro-user license (GPL).
That said, stability overall has been nothing short of fantastic. And I can't answer the question of "why?" for this particular rewrite.
I’m all for the hundreds of reasonable objections but this sort of trash mindless critique is as useless as what it denounces.
You don't get to choose a license and then add extra terms to it when you don't feel like it's up to scratch. That's something explicitly not allowed by the GPL license.
Isn’t having to stay under the GPL a very big part of the GPL license?
The first part of this sentence (where in the GPL) is unreached if the second part of it is unmet (relicense code or derivatives) which I contend it likely is. You're begging the question.
However:
> The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work
earlier:
> A “covered work” means either the unmodified Program or a work based on the Program.
It's that element that would be difficult to prove "work based on the Program"
"here's a test suite, write code in rust that makes that suite pass" is reasonably supported by the article. That would likely not be a derivative work.
I could have missed them. I didn’t read everything. I did some quick searches.
But the fact they’re not obvious is kind of troubling. Or that they didn’t just copy the tests and documentation for the LLM and not the source to prevent it from looking would hurt any case they had for clean-room privileges in my eyes, ignoring my other comment with concerns about using the tests at all.
IMO, IANAL, etc.
And we’ll ignore the question of what the fact the LLM has certainly seen the git code during training means.
But the test suite would have to stay under the original license. And if you use a GPL test suite as they kernel to develop a program from can you license it non-GPL? I’d question that personally. Same acronyms above apply.
> A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
So assuming that sum(a, b) is non-infringing and not combined to form a larger program (i.e. the tests aren't compiled into the grit code), then the GPL explicitly doesn't apply to this use
But if you take all the individual tests used to test git as a whole, that seems far more unique. Seems like at that point you’re really having to duplicate the actual git internals, and that seems like it should be covered.
Feel free to extrapolate to the threshold where it's not and at that point apply.
> you’re really having to duplicate the actual git internals
Copyright covers the expression, not the method. So the Rust function:
fn sum(a: u8, b: u8) {
a + b
}
is distinct from the C function: int sum(int a, int b)
{
return a + b;
}Similarly, is there any momentum left for Cloudflare's EmDash? I can barely find any discussion after April.
> it has been nearly entirely written by agents and has not been used for realsies. It's probably currently unusably slow or completely broken in ways that are not exercised in the test suite.
Right now it's someone else's experiment that is still in the "might or might not pan out" stage.
There are a bunch of projects using the similar (not vibe coded, less fully featured) gitoxide project - there is demand for git-as-a-library.
The author of gitoxide is also working on GitButler (who worked on this project) and we're pushing both projects forward and actively using and developing Gitoxide as well. This is simply a different and hopefully complimentary approach to the same problem.
Sorry, no. Let me be candid and point out that this has achieved exactly nothing except lighting $8k on fire.
Put it this way: if I suggest to my boss, "I want to spend $8k of company money to port git to Rust to just see how many tests can pass in that project, even though I don't plan to develop new features with the project, and I don't care about adoption", he is going to shot down the idea in half a second and seriously question my competence.
Why not just make better Python bindings to libgit?
It's an organic success, hard to replicate. If at all, CF can only make people migrate with massive effort. Marketing effort, selling lots of snake oil in the process. WP wont just hop on the hot new thing, WP is the definition of the opposite. It works for them. Why change.
Git is the same on the other side. It requires maintenance and improvements, surgical and correct. No git maintainer has time to learn a gigantic new codebase and they will stick with what works for them. For git users there are no advantages. So similarly it would require a long time effort to push the project, building trust that it is somehow better, probably requiring Linus to say "it's great".
I downloaded v0.3.99 for Linux x86_64 and stripped the binary. It ends up at 31 MB. The .text section is 25 MB.
I'm surprised by the large size. On my system /usr/bin/git is 4.7 MB, although git is split up into multiple programs. I'm not comparing apples to apples, but this is weird.
If anyone digs into the binary size, please share what you find.
I haven't dug into this at all yet, nor have I tried to optimize the size (or really, anything else).
However, the library part will be less than half of this - a lot of code is spent on the CLI specific stuff and would not be part of the library, which is mostly what I care about for the purposes of this project. The CLI part is just to try to prove the point that it actually does what Git does. The library part is what might be useful in that nothing else exists that does all of the things that it does (provide a reentrant linkable library that is feature complete with Git).
Splitting it by crate: `grit` is 13.6MiB, `grit_lib` is 4.8MiB and then it's `std`, `rustls` and `regex_automata` that are the next largest. So as pure library you could hopefully shave off quite a bit of that 25MiB.
[1] https://github.com/gitbutlerapp/grit/blob/main/grit-lib/src/...
[1]. https://github.com/ianm199/lua-rs/tree/main Lua
[2]. https://github.com/ianm199/valdr Valkey/ Redis
[3]. https://github.com/ianm199/nginx-rs-port nginx
Happy to answer any questions on the approach! When I started a few weeks ago the harnesses on their own were not good enough to get very far without a "meta harness" of sorts but that is changing largely with Claude Workloads and Mythos. A lot of the work is developing some custom tooling to move these along faster.
But in terms of learning I'm learning relatively little about how to type Rust into an editor but a lot about how to set up agentic loops that can autonomously get tests to pass and improve performance.
For example if you just tell a frontier model (gpt5.5 or Claude Code 4.8) to make some portion of the tests pass they will take forever and just bang their heads against it. I developed a framework to mimic a lot of these tests in nginx... but in minimum non blocking ways so you can run many in parallel with short feedback loops.
Similar for performance - how to make tons of performance benchmark and expose maximum telemetry for agents to go and analyze the hotpaths etc.
That is true, however did you actually do any research into nginx? Is it particularly prone to memory bugs?
I honestly don't know the answer but you seem to be coming from a place of C bad, therefore nginx super vulnerable?
In my experience with other web servers the vast majority of security bugs are string handling related (path/header injection), which your rewrite will not protect you from.
The project was inspired by that. Also unlike most other projects, nginx is directly exposed to the internet often times which makes it more vulnerable than i.e. Redis/ Valkey or something that would be running within a companies network generally.
"C Bad" is a bit reductionist... but I think there is some truth to the take " Until you have the evidence, don’t bother with hypothetical notions that someone can write 10 million lines of C without ubiquitious memory-unsafety vulnerabilities – it’s just Flat Earth Theory for software engineers" [1]
NSA and other government orgs are also pushing people to stop using C [2] for important software.
[1]. https://alexgaynor.net/2020/may/27/science-on-memory-unsafet... [2]. https://linuxsecurity.com/news/government/nsa-s-plea-stop-us...
No one really knows what the endgame of software security looks like.
So some people should try the port to rust angle, some should focus on hardening the C, some should explore more exotic options like formally provable languages etc
1) There may be situation were are fork makes sense (e.g. because one project can not serve different use cases well): 2) Which is why usually a "higher goal" is used to justify this, e.g. authors pretend (or lie to themselves, or may be be stupid enough to actually believe this) that some improvement in memory safety is really that important.
You've been caricatured into a blind AI-follower rust-rewriter-just-because type, and that's the surface they'll continually attack (you're wasting time, hurting the community, v2-itis, bikeshedding, premature optimization, copyright violation, moustache-twirling-evil-intent-rug-pull-later, etc etc etc).
Just continue in your work. It's good, and we need people like you.
Please excuse me for being unnecessarily harsh for a moment, but web servers are a dime a dozen. Quite literally. The reason nginx is successful is because it is maintained. Unless you plan on maintaining your nginx-clone as well as nginx itself, it will not be useful.
Perhaps you do, in which case I am more than happy to be wrong, but sometimes people think the act of writing software itself is useful and that other people will happily swarm over it and maintain it in their absence, but that is usually not the case.
The world has tens of thousands of http daemons, increasing that number by one is not useful in itself. The act of maintaining software over time and keeping it useful for many people however, absolutely can be.
Basically I use these "kits" to prove that the behavior is working as expected with mocked data/ interfaces and then only after these kits pass I'll run the real test suite files as confirmation. So these let you iterate a lot faster than the official test suite because it is very slow.
These are bootstrapped from the real tests.
The other commenter was being a bit dismissive but this is the kind of thing I'm taking away as a real useful pattern to do verification of behavior at scale.
I have no idea why you are making me spell this out, I thought it was pretty obvious.
I want to get it to the point where we can replace fork/exec'ing to an unknown Git binary or having said binary be an external dependency for GitButler. The networking stuff (push/fetch) is currently an external dep for both GitButler and Jujutsu (and pretty much every other Git-based tool in the world). I'm pretty sure I can get the project good enough at these networking ops (including all the hairy credential stuff) to be able to not need those fork/exec calls.
This is morally, if not legally, wrong.
Rustwashing?
The pattern I see here is people vibing slop rewrites of GPL projects to get them under more favorable licenses. Rust just happens to be the language this one picked (for various reasons that are not very relevant here.)
Goal is to be able to transfer context from one agent to another when switching which provider is being used. So when i hit usage limits on claude i can can run handoff claude codex and codex is given a md file to start from and continuing working.
Still early but ive found it useful in my daily flows already!
We clearly learned from how Git does operations and emulated it in order to function interoperably, the same way that Gitoxide and libgit2 have, and released it under a license that would be the most valuable for people wanting to use a linkable library, the same way that Gitoxide and libgit2 have.
Not impossible. It forces the code using the library to be under a GPL-compatible license and requires the binary to be released under the GPL license.
The distinction is quite important. It's only impossible in the mind of someone who wants to release proprietary software. Even for people releasing software under permissive license it's not impossible, just highly inconvenient (and the LGPL is always an option in this case).
What a weaselly way to put it.
A GPL library, as I'm sure you know, is perfectly usable by anyone including jujutsu and anyone else. They just have to also license under the GPL and this is no barrier to open source projects.
Ok, my coffee just kicked in and I'm incensed. Might go do an FSF donation.
So you didn't just let an AI go nuts with access to the source code of Git so as to produce a derivative work?
And what issues did GPL impose on the community all these 21 years of git existence?
Also, I worked on the Ruby Grit pretty extensively during the early days of GitHub, so hopefully I earned the right to carry on the mantle. :)
I pray everything switches to usage based billing and the curtains can close on this era.
> Currently both Gitoxide and libgit2's networking functionality is either partial, slow or non-existant. Both GitButler and Jujutsu rely on forking out to Git in order to push or pull data. A big reason for this is the incredibly complicated credential logic involved, but all of this is (theoretically) currently covered in Grit.
You decide whether you have followed it or not. The other party will decide if they agree. If in dispute, you go to a judge and they decide also.
it's just in this case it's the author. we'll have to wait and see who decides to challenge it
> You had me at WASM...
What does this mean? Does the OP want us to convert his AI-vibed code to a WASM-compatible build? Does the OP even understand what that entails?
In fact, I would rather it stay C for 15 more years.
Don't bother.
It's probably not for you. It's slower, more obtuse, more bloated, less capable, exponentially less scalable at any size. Canonical Git is better in every way, except being a linkable library.
Even in the arena of being linkable libraries that can do Git stuff, both Gitoxide (Rust) and libgit2 (C which has git2 crate Rust bindings) are both better, they're just not feature complete. That is the only point of this project.
The git test suite is a behavioral spec. But at the same time it is not, why? It's archaeological site... I'd say. You can dig, and dig, and dig, and find some truth, but also a lot of historical decisions that doesn't matter today.
Here's the flow I suggest:
- first, reverse-engineer the behavioral intent from tests/docs/code
- build a taxonomy of what git promises
- group that into small "conformance slices"
- hand those slices to agents/humans/whatever
- start writing Rust... Or Visual Basic... At this point it doesn't matter almost
Without this critical layer, agents are optimizing for "make this test green" instead of "preserve this semantic contract". And this is exactly where the funny stuff happens... shelling to real git, hardcoding expectations, implementing sha256 metadata but not the behavior, etc.
Why are we doing stupid things and winning stupid prizes? I have to admit they are impressive, but I STRONGLY believe if we did this in two passes you would have not $15k check but maybe closer to $1-5k.
Who should I talk to, to make this happen? Making the first pass is almost deterministic. LLMs help. The only problem is making sure that YOU understand the spec, and this will be a bottleneck for a while (i.e. can't outsource understanding).
I can guide what exactly needs to happen. I already validated this idea on my own project and it worked: 10k LoC -> 250 acceptance criteria. And you can find it on my GitHub, I even described steps. If you're too lazy here it is:
You need a SCIP graph. Agent goes over it and builds ledger, one symbol at a time, looks at what kind of problem this symbol solves. This leads you to v0 specs, then you can figure out taxonomy, then you try to fit all those thousands of specs into taxonomy / groups that make sense. And that would be v1 spec but you might want to refine it to v2/v3. And then the only thing would be left is figure out what kind of tests those specs should have (e2e/unit/integration/api/whatever). This is tricky part but doable. I'm thinking for git you wanna do e2e specs. Yes, that's a lot of e2e but the purpose is that we build same expectations for git, and then we replace git with grit and spec should still stay green, right?
Hope that makes sense.
Reimplementation is a particularly juicy target because it's easy to test. Imagine someone writing a better browser than Chrome from scratch in just a year.
Because of this moats around business due to difficulty of implementation are effectively gone.
Especially if there's the same thing that already exists in open source that the model can plagiarize for you.
Why not 100%?
> It's not actually passing every single test, though that is on purpose. I did mark some parts of the testing suite as "skipped" because I don't think it's worth recreating them in a library like this
> 41,715 / 42,001 tests passing (99.3%)
So it is not entire then but somehow that was worth burning $8,000~ dollars worth of tokens?
From the article
> It's not actually passing every single test, though that is on purpose. I did mark some parts of the testing suite as "skipped" because I don't think it's worth recreating them in a library like this - email related stuff, i18n, perforce/svn importers, some of the midx/bitmap stuff - things of that nature. However, for everything that I'm sure is relevant to nearly anyone reading this, the Grit library/CLI can now fully pass the Git test suite.
> Having parts of Git as discrete, embeddable slices of library also enables things like building custom Git servers or client functionality in Rust.
> The full build of all Git functionality in Rust is currently around 27M, but since a large part of it is a library, it could clearly be easily split up into domains of functionality - subcrates that do specific things. Perhaps you could simply use the subset you need.
> it made me wonder about the feasibility of using that same approach to accomplish something I've been dreaming about for 15 years now,
> which means that it's difficult to use it in long running processes without fork/exec overhead for everything.
> What if we used the same basic idea that Anthropic used on their from-scratch C compiler? Start a brand new implementation, design it as a Rust library, then throw a swarm of agents at the problem
I don't care if any git I use has email features. IIUC, even most of the people that use git with email don't directly use the email features, they use the patch set features like `git am`. I expect `git am` to work, I don't expect git to actually do email.