> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.
I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.
Then we want to do things like update the "root" system prompt and have that applied everywhere.
There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.
Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?
are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?
why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)
For one, it eliminates a class of merge conflict that arises strictly from text formatting.
I always liked the idea of storing code in abstraction, especially editors supported edit-time formatting. I enjoy working on other people's code, but I don't think anybody likes the tedium of complying with style guides, especially ones that are enforced at the SCM level, which adds friction to creating even local, temporary revisions. This kind of thing would obviate that. That's why I also appreciate strict and deterministic systems like rustfmt. Unison goes a little further, which is neat but I think they're struggling getting adoption because of that, even though I'm pretty sure they've got some better tooling for working outside the whole ecosystem. These decoupled tools are probably a good way to go.
I was messing around with a file-less paradigm that would present a source tree in arbitrary ways, like just showing a individual functions, so you have the things you're working on co-located rather than switching between files. Kind of like the old VB IDE.
I think it sees very little usage though.
But, yeah, I think a personalized development environment with all of your preferences preserved and that don't interfere with whatever the upstream standard is would be a nice upgrade.
There are some clever things that can be done with merge/split using CRDTs as the stored transformation, but they're hard to reason about compared to just semantic merge tools, and don't outweigh the cognitive overhead IMO.
Having worked for many years with programming systems which were natively expressed as trees - often just operation trees and object graphs, discarding the notion of syntax completely, this layer is incredibly difficult for humans to reason about, especially when it comes to diffs, and usually at the end you end up having to build a system which can produce and act upon text-based diffs anyway.
I think there's some notion of these kinds of revision management tools being useful for an LLM, but again, at that point you might as well run them aside (just perform the source -> AST transformation at each commit) rather than use them as the core storage.
Easier but much less valuable.
The text based diff method works fine if everyone is working off a head, but when you're trying to compose a release from a lot of branches it's usually a huge mess. Text based diffs also make maintaining forks harder.
Git is going to become a big bottleneck as agents get better.
first you should not be composing releases at the end from conflicting branches, you should be integrating branches and testing each one in sequence and then cutting releases. if there are changes to the base for a given branch, that means that branch has to be updated and re-tested. simple as that. storing changes as normalized trees rather than normalized text doesn't really buy you anything except for maybe slightly smarter automatic merge conflict resolution but even then it needs to be analyzed and tested.
The downside is tight coupling between VCS and editor. It will be difficult to convince developers to use anything else than their favourite editor when they want to use your VCS.
I wonder if you can solve it the language-server way, so that each editor that supports refactoring through language-server would support the VCS.
And you try and tell the young people of today about RAS syndrome, they won't believe you!
If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.
The way it makes an AST tree is non-lossy. Additionally, it stamps ids on the nodes, so merges do not get confused by renames, formatting changes and similar things. There is value in preserving structure this way that repeat parsing can not provide. In big-O terms, working with such an AST tree and a stack of its patches is not much different from stacks of binary diffs git is using.
If I have k independent changesets, I have k^2 unplanned interactions and 2^k unplanned change combinations. Having a bunch of change sets, which I had not fully evaluated yet, esp in relation to one another, I would like k-way merges and repeat-merges to be seamless, non-intrusive and deterministic. git's merges are not.
The project is experimental at this point.
If you collect test cases for compilers, for example.
> tree-sitter sort of handles that
My worry is that stability of committed ASTs would depend on tree-sitter being stable, and it might be difficult to guarantee that for languages are still in flux. Even most well established languages gain new grammar once every few years, sometimes in backward incompatible ways.
Maybe you meant tree-sitter itself will also be versioned inside this repository?
Also, there is an option to pick a codec for a particular file. Might use tree-sitter-C, might use general-text. The only issue here, you can't change the codec and keep nice diffs.
So, these cases are handled.
I think this project is incredibly cool as a line of research / thought but my general experience in trying to provide human interfaces using abstractions over source code suggests that most people in general and programmers especially are better at reasoning in the source code space. Of course, beagle can generate into the source code space at each user interaction point, but at that point, why not do the opposite thing, which is what we already do with language servers and AST driven (semantic) merge and diff tools?
Would you say these are commonly in use, and if so what are some "mainstream" examples? IME most people just use git's built-in diff/merge...
From "Large Language Models for Mathematicians (2023)" (2025) https://news.ycombinator.com/item?id=42899805 :
> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.
> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?
There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?