Beagle, a source code management system that stores AST trees
93 points
21 hours ago
| 11 comments
| github.com
| HN
nzoschke
16 hours ago
[-]
In https://replicated.wiki/blog/partII this part is very interesting to me:

> Want to keep LLM .md files in a separate overlay, only make them visible on request? Also easy. CRDT gives the freedom in splitting and joining along all the axes.

I now have a bunch of layers of text / markdown: system prompts, AGENTS.md, SKILL.md, plus user tweaks or full out replacements to these on every repo or subproject.

Then we want to do things like update the "root" system prompt and have that applied everywhere.

There are analogies in git, CMS templating systems, software package interfaces and versioning. Doing it all with plain text doesn't feel right to me.

Any other approaches to this problem? Or is Beagle and ASTs and CDRTs really onto something here?

reply
a-dub
17 hours ago
[-]
mmm. interesting and fun concept, but it seems to me like the text is actually the right layer for storing and expressing changes since that is what gets read, changed and reasoned about. why does it make more sense to use asts here?

are these asts fully normalized or do (x) and ((x)) produce different trees, yet still express the same thing?

why change what is being stored and tracked when the language aware metadata for each change can be generated after the fact (or alongside the changes)? (adding transform layers between what appears and what gets stored/tracked seems like it could get confusing?)

reply
ibejoeb
17 hours ago
[-]
> why does it make more sense to use asts here

For one, it eliminates a class of merge conflict that arises strictly from text formatting.

I always liked the idea of storing code in abstraction, especially editors supported edit-time formatting. I enjoy working on other people's code, but I don't think anybody likes the tedium of complying with style guides, especially ones that are enforced at the SCM level, which adds friction to creating even local, temporary revisions. This kind of thing would obviate that. That's why I also appreciate strict and deterministic systems like rustfmt. Unison goes a little further, which is neat but I think they're struggling getting adoption because of that, even though I'm pretty sure they've got some better tooling for working outside the whole ecosystem. These decoupled tools are probably a good way to go.

I was messing around with a file-less paradigm that would present a source tree in arbitrary ways, like just showing a individual functions, so you have the things you're working on co-located rather than switching between files. Kind of like the old VB IDE.

reply
micw
17 hours ago
[-]
An AST based conflict resolver could eliminate the same kind of merge conflicts on a text based RCS
reply
ibejoeb
17 hours ago
[-]
Yeah I suppose that's true, too. You've got to do the conversion at some point. I don't know that you get any benefit of doing storing the text, doing the transformation to support whatever ops (deconflicting, etc.) and then transforming back to text again vs just storing it in the intermediate format. Ideally, this would all be transparent to the user anyway.
reply
gritzko
16 hours ago
[-]
For one merge, yes. The fun starts when you have a sequence of merges. CRDTs put ids on tokens, so things are a bit more deterministic. Imagine a variable rename or a whitespace change; it messes text diffing completely.
reply
hosh
10 hours ago
[-]
I remember someone mentioning a system that operated with ASTs like this in the 70s or 80s. One of the affordances is that the source base did not require a linter. Everyone reading the code can have it formatted the way they like, and it would all still work with other people’s code.
reply
skybrian
10 hours ago
[-]
It seems like that could be done in the editor if you auto-reformat on load and save. (Assuming there's an agreed-on canonical format.)
reply
psadri
17 hours ago
[-]
Related, I’d love an editor that’d let me view/edit identifier names in snake_case and save them as camelCase on disk. If anyone knows of such a thing - please let me know!
reply
Wilfred
14 hours ago
[-]
This is actually possible with glasses-mode in Emacs: https://codelearn.me/2025/02/24/emacs-glasses-mode.html

I think it sees very little usage though.

reply
ibejoeb
17 hours ago
[-]
Sure. Presumably you could have localized source presentation, too.

But, yeah, I think a personalized development environment with all of your preferences preserved and that don't interfere with whatever the upstream standard is would be a nice upgrade.

reply
bri3d
17 hours ago
[-]
100% agree. I think AST-driven tooling is very valuable (most big companies have internal tools akin to each operation Beagle provides, and Linux have Coccinelle / Spatch for example), but it's still just easier implemented as a layer on top of source code than the fundamental source of truth.

There are some clever things that can be done with merge/split using CRDTs as the stored transformation, but they're hard to reason about compared to just semantic merge tools, and don't outweigh the cognitive overhead IMO.

Having worked for many years with programming systems which were natively expressed as trees - often just operation trees and object graphs, discarding the notion of syntax completely, this layer is incredibly difficult for humans to reason about, especially when it comes to diffs, and usually at the end you end up having to build a system which can produce and act upon text-based diffs anyway.

I think there's some notion of these kinds of revision management tools being useful for an LLM, but again, at that point you might as well run them aside (just perform the source -> AST transformation at each commit) rather than use them as the core storage.

reply
ragall
16 hours ago
[-]
> but it's still just easier implemented as a layer on top of source code than the fundamental source of truth

Easier but much less valuable.

reply
a-dub
16 hours ago
[-]
you can parse the text at any time pretty much for free and use anything you learn to be smarter about manipulating the text. you can literally replace the default diff program with one that parses the source files to do a better job today.
reply
derriz
13 hours ago
[-]
This is the fundamental idea behind git - to fully compute/derive diffs from snapshots (commits) and to only store snapshots. While brilliant in some ways - particularly the simplifications it allows in terms of implementation, I’ve always felt that dropping all information about how a new commit was derived from its parent(s) was wasteful. There have been a number of occasions where I wished that git recorded a rename/mv somehow - it’s particularly annoying when you squash some commits and suddenly it no longer recognizes that a file was renamed where previously it was able to determine this. Now your history is broken - “git blame” fails to provide useful information, etc. There are other ways of storing history and revisions which don’t have this issue - git isn’t the end of the line in terms of version control evolution.
reply
gritzko
15 hours ago
[-]
CRDT's trick is metadata. Good old diff guesses the changes by solving the longest-common-subsequence problem. There is always some degree of confusion as changes accumulate. CRDTs can know the exact changes, or at least guess less.
reply
CuriouslyC
17 hours ago
[-]
One nice thing about serializing/transmitting AST changes is that it makes it much easier to to compose and transform change sets.

The text based diff method works fine if everyone is working off a head, but when you're trying to compose a release from a lot of branches it's usually a huge mess. Text based diffs also make maintaining forks harder.

Git is going to become a big bottleneck as agents get better.

reply
a-dub
16 hours ago
[-]
what do you actually gain over enforced formatting?

first you should not be composing releases at the end from conflicting branches, you should be integrating branches and testing each one in sequence and then cutting releases. if there are changes to the base for a given branch, that means that branch has to be updated and re-tested. simple as that. storing changes as normalized trees rather than normalized text doesn't really buy you anything except for maybe slightly smarter automatic merge conflict resolution but even then it needs to be analyzed and tested.

reply
CuriouslyC
16 hours ago
[-]
Diffs are fragile, and while I agree with that process in a world where humans do all the work and you aren't cutting a dozen different releases, I think that's a world we're rapidly moving away from.
reply
a-dub
12 hours ago
[-]
in that case you probably flag a bunch of prs for release and it linearizes their order and rebases and tests each one a step ahead of your review (responding to any changes you make as you go).
reply
sse
13 hours ago
[-]
Having a VCS that stores changes as refactorings combined with an editor that reports the refactorings directly to the VCS, without plain text files as intermediate format, would avoid losing information on the way.

The downside is tight coupling between VCS and editor. It will be difficult to convince developers to use anything else than their favourite editor when they want to use your VCS.

I wonder if you can solve it the language-server way, so that each editor that supports refactoring through language-server would support the VCS.

reply
majkinetor
16 hours ago
[-]
Somewhat similar project is unison:

https://www.unison-lang.org/docs/the-big-idea

reply
ValentineC
15 hours ago
[-]
Mildly pedantic, but AST already stands for Abstract Syntax Tree, so the post title when unabbreviated is Abstract Syntax Tree trees.
reply
MadxX79
18 hours ago
[-]
Can it store my PIN numbers and my map of ATM machines also?
reply
wolfi1
18 hours ago
[-]
was about to point that out, you beat me to it
reply
MadxX79
18 hours ago
[-]
I rushed so much that I didn't have time to do it right. It could have been the AST tree of my PIN number validation algorithm for ATM machines. :-P
reply
danparsonson
10 hours ago
[-]
I don't think your original post suffered for the lack of one more TLA acronym.
reply
pseudohadamard
5 hours ago
[-]
Right! I had to get up in the morning, at ten o'clock at night, half an hour before I went to bed... sorry, wrong sketch. I had to set up my PIN number to display on the LCD display of an ATM machine with the instructions printed in PDF format telling me how to add VAT tax all before midday GMT time.

And you try and tell the young people of today about RAS syndrome, they won't believe you!

reply
omoikane
16 hours ago
[-]
The linked page looks like a subsystem of some specific library, I am not sure if it is intended for general use.

If it were intended to be general replacement for general purpose version control systems, I am not sure how storing AST is better than storing the original plain text files since the transformation from text to AST might be lossy. I might want to store files with no AST (e.g. plain text files), files with multiple AST (e.g. polyglots), multiple files with the same AST (e.g. files to test different code layout), broken AST (e.g. data files to be used as test cases). These use cases would be trivially supported by storing the original file as is, whereas storing any processed form of the file would require extra work.

reply
gritzko
15 hours ago
[-]
(Author) There is a fall-back general-text codec: tokens, no AST (e.g. for Markdown). If that fails (non UTF8), there is the general-blob final-fallback codec (the git mode).

The way it makes an AST tree is non-lossy. Additionally, it stamps ids on the nodes, so merges do not get confused by renames, formatting changes and similar things. There is value in preserving structure this way that repeat parsing can not provide. In big-O terms, working with such an AST tree and a stack of its patches is not much different from stacks of binary diffs git is using.

If I have k independent changesets, I have k^2 unplanned interactions and 2^k unplanned change combinations. Having a bunch of change sets, which I had not fully evaluated yet, esp in relation to one another, I would like k-way merges and repeat-merges to be seamless, non-intrusive and deterministic. git's merges are not.

The project is experimental at this point.

reply
jhancock
10 hours ago
[-]
AST of what? Will it read my clojure code's forms as such? What if my source file has a paran balancing error? I feel I'm thinking of this at the wrong level/angle.
reply
gritzko
8 hours ago
[-]
I cannot remember a case, in the last 10 years at least, when I committed code that does not compile. Why should I share that? Also, tree-sitter sort of handles that.
reply
omoikane
7 hours ago
[-]
> code that does not compile. Why should I share that?

If you collect test cases for compilers, for example.

> tree-sitter sort of handles that

My worry is that stability of committed ASTs would depend on tree-sitter being stable, and it might be difficult to guarantee that for languages are still in flux. Even most well established languages gain new grammar once every few years, sometimes in backward incompatible ways.

Maybe you meant tree-sitter itself will also be versioned inside this repository?

reply
gritzko
3 hours ago
[-]
Tree-sitter can parse somewhat-bad code.

Also, there is an option to pick a codec for a particular file. Might use tree-sitter-C, might use general-text. The only issue here, you can't change the codec and keep nice diffs.

So, these cases are handled.

reply
sethev
16 hours ago
[-]
It leans on tree-sitter for language handling, so i wonder if they're actually Concrete Syntax Trees.
reply
xedrac
18 hours ago
[-]
This sounds good in theory, but it means Beagle needs to understand how to parse every language, and keep up with how they evolve. This sounds like a ton of work and a regression could be a disaster. It'll be interesting to see how this progresses though.
reply
bri3d
17 hours ago
[-]
IMO this really isn’t a huge problem for this project specifically, since that part is outsourced to tree-sitter which has a lot of effort behind it to begin with.

I think this project is incredibly cool as a line of research / thought but my general experience in trying to provide human interfaces using abstractions over source code suggests that most people in general and programmers especially are better at reasoning in the source code space. Of course, beagle can generate into the source code space at each user interaction point, but at that point, why not do the opposite thing, which is what we already do with language servers and AST driven (semantic) merge and diff tools?

reply
ibejoeb
16 hours ago
[-]
It's also just one more facet. The problem already exists for anything else that we already have, like formatters, linters, syntax highlighters, language servers... And it's also not an exclusive choice. If you want to use a dumb editor, there's nothing preventing that. All of the machinery to go back and forth to text exists. Not really a huge departure.
reply
computably
13 hours ago
[-]
> AST driven (semantic) merge and diff tools?

Would you say these are commonly in use, and if so what are some "mainstream" examples? IME most people just use git's built-in diff/merge...

reply
mtndew4brkfst
6 hours ago
[-]
I find Mergiraf pretty pleasant to use and frequently pretty helpful as a time-saver. Handles TOML and Rust for me, and I have way fewer manual interventions, especially after supplementing it with rustfmt rules to not do a bunch of merged use statements in one go. Easy to configure as a jujutsu tool as well.

https://mergiraf.org/

reply
ktpsns
18 hours ago
[-]
Glad to see this. We can do better then git.
reply
_ZeD_
16 hours ago
[-]
who is "we"? and "better" in what measure?
reply
thunderbong
18 hours ago
[-]
Care to elaborate?
reply
Maxious
18 hours ago
[-]
reply
BlueHotDog2
13 hours ago
[-]
what bothers me is, while CRDTS converge, the question is to what. in this case, it seems like there's a last-write-wins semantic. which is very problematic as an implicit assumption for code(or anything where this isn't the explicit invaraint)
reply
westurner
17 hours ago
[-]
It makes a lot of sense for math-focused LLMs to work with higher order symbols - or context-dependent chunking - than tokens. The same is probably true for software.

From "Large Language Models for Mathematicians (2023)" (2025) https://news.ycombinator.com/item?id=42899805 :

> It makes sense for LLMs to work with testable code for symbolic mathematics; CAS Computer Algebra System code instead of LaTeX which only roughly corresponds.

> Are LLMs training on the AST parses of the symbolic expressions, or token coocurrence? What about training on the relations between code and tests?

There are already token occurrence relations between test functions and the functions under test that they call. What additional information would it be useful to parse and extract and graph rewrite onto source code before training, looking up embeddings, and agent reasoning?

reply