I also wasn't familiar with this terminology:
> You hand it a function; it tries to match it, and you move on.
In decompilation "matching" means you found a function block in the machine code, wrote some C, then confirmed that the C produces the exact same binary machine code once it is compiled.
The author's previous post explains this all in a bunch more detail: https://blog.chrislewis.au/using-coding-agents-to-decompile-...
Eventually some or many of these attempts would, of course, fail, and require programmer intervention, but I suspect we might be surprised how far it could go.
What LLMs are (still?) not good at is one-shot reverse engineering for understanding by a non-expert. If that's your goal, don't blindly use an LLM. People already know that you getting an LLM to write prose or code is bad, but it's worth remembering that doing this for decompilation is even harder :)
I asked Opus how hard it would be to port the script extender for Baldurs Gate 3 from Windows to the native Linux Build. It outlined that it would be very difficult for someone without reverse engineering experience, and correctly pointed out they are using different compilers, so it's not a simple mapping exercise. It's recommendation was not to try unless I was a Ghrida master and had lots of time in my hands.
I've had CC build semi-complex Tauri, PyQT6, Rust and SvelteKit apps for me without me having ever touched that language. Is the code quality good? Probably not. But all those apps were local-only tools or had less than 10 users so it doesn't matter.
For this project, it described its reasoning well, and knowing my own skillset, and surface level info on how one would start this, it had many good points that made the project not realistic for me.
Which is to say that probably antropic don’t have good training documents and evals to teach the model how to do that.
Well they didn’t. But now they have some.
If the author want to improve his efficiency even more, I’d suggest he starts creating tools that allow a human to create a text trace of a good run on decompilating this project.
Those traces can be hosted in a place Antropic can see and then after the next model pre-training there will be a good chance the model become even better at this task.
Not what I would have expected from a 'one-shot'. Maybe self-supervised would be a more suitable term?
See also, "zero-shot" / "few-shot" etc.
We're essentially trying to map 'traditional' ML terminology to LLMs, it's natural that it'll take some time to get settled. I just thought that one-shot isn't an ideal name for something that might go off into an arbitrarily long loop.
1. Getting an LLM to do something based on a single example
2. Getting an LLM to achieve a goal from a single prompt with no follow-ups
I think both are equally valid.
It doesn't do it in one-shot on the GPU either. It feeds outputs back into inputs over and over. By the time you see tokens as an end-user, the clanker has already made a bunch of iterations.
> An open-source license is a type of license for computer software and other products that allows the source code, blueprint or design to be used, modified or shared (with or without modification) under defined terms and conditions.
https://en.wikipedia.org/wiki/Open_source
Companies have been really abusing what open source means- claiming something is "open source" cause they share the code and then having a license that says you can't use any part of it in any way.
Similarly if you ever use that software or depending on where you downloaded it from, you might have agreed not to decompile or read the source code. Using that code is a gamble.
Relevant: https://github.com/orgs/community/discussions/82431
> When you make a creative work (which includes code), the work is under exclusive copyright by default. Unless you include a license that specifies otherwise, nobody else can copy, distribute, or modify your work without being at risk of take-downs, shake-downs, or litigation. Once the work has other contributors (each a copyright holder), “nobody” starts including you.
That's very different from the decompilation projects being discussed here, which do distribute the decompiled code.
These decompilation projects do involve some creative choices, which means that the decompilation would likely be considered a derivative work, containing copyrightable elements from both the authors of the original binary and the authors of the decompilation project. This is similar to a human translation of a literary work. A derivative work does have its own copyright, but distributing a derivative work requires permission from the copyright holders of both the original and the derivative. So a decompilation project technically can set their own license, and thereby add additional restrictions, but they can't overwrite the original license. If there is no original license, the default is that you can't distribute at all.
They had gotten surprisingly close to a complete decompilation, but then they tried to request a copy of the source code from the copyright office citing that they needed it as a result of ongoing unrelated litigation with Nintendo.
Later on this killed them in court.
I expect that for games the more important piece will be the art assets - like how the Quake game engine was open source but you still needed to buy a copy of the game in order to use the textures.
FOSS specifically means/meant free and open source software, the free and software words are there for a reason
so we don’t need another distinction like “source available” that people need to understand to convey an already shared concept
yes, companies abuse their community’s interest in something by blending open source legal term as a marketing term
See my other comment: https://news.ycombinator.com/item?id=46175760
It'll either all be in the cloud, so you never run the code...
Or it'll be on a chip, in a hermetically sealed usb drive, that you plug in to your computer.
Obviously others aren’t concerned or don’t live in jurisdictions where that would be an issue.
Though, that’s only for actively developer software. I can imagine a great future where all retro games are now source available.
First we get to tackle all of the small ideas and side projects we haven't had time to prioritize.
Then, we start taking ownership of all of the software systems that we interact with on a daily basis; hacking in modifications and reverse engineering protocols to suit our needs.
Finally our own interaction with software becomes entirely boutique: operating systems, firmware, user interfaces that we have directed ourselves to suit our individual tastes.
And it will be great for retro game preservation.
Having more integrated tools and tutorials on this would be awesome.
I stayed away from decompilation and reverse engineering, for legal reasons.
Claude is amazing. It can sometimes get stuck in a reason loop but will break away, reassess, and continue on until it finds its way.
Claude was murdered in a dark instance dungeon when it managed to defeat the dragon but ran out of lamp oil and torches to find its way out. Because of the light system it kept getting “You can’t seem to see anything in the darkness” and randomly walked into a skeleton lair.
Super fun to watch from an observer. Super terrifying that this will replace us at the office.
It's good at cleaning up decompiled code, at figuring out what functions do, at uncovering weird assembly tricks and more.
Anyway, we're reaching the point where documentation can be generated by LLMs and this is great news for developers.
I don't want invented rationales for changes, I want to know the actual reason a developer decided that the code should work that way.
Not so much when you have a lot of code from 6 years ago, built around an obscure SDK, and you have to figure out how it works, and the documentation is both incredibly sparse and in Chinese.
If people __can__ actually read undocumented code with the help of LLMs, why do you need human-written documentation really?
I then pasted this to another CC instance running the FE app, and it made the counter part.
Yes, I could have CC running against both repos and sometimes do, but I often run separate instances when tasks are complex.
Although of course if you don't vibe document but instead just use them as a tool, with significant human input, then yes go ahead.
I hope that others find this similarly useful.
The hardest form of code obfuscation is called homomorphic computing, which is code transformed to act on encrypted data isomorphically to regular code on regular data. The homomorphic code is hard obfuscated by this transformation.
Now create a homomorphic virtual machine, that operates on encrypted code over encrypted data. Very hard to understand.
Now add data encryption/decryption algorithms, both homomorphically encrypted to be run by the virtual machine, to prepare and recover inputs, outputs or effects of any data or event information, for the homomorphic application code. Now that all data within the system is encrypted by means which are hard obfuscated, running on code which is hard obfuscated, the entire system becomes hard^2 (not a formal measure) opaque.
This isn't realistic in practice. Homomorphic implementations of even simple functions are extremely inefficient for the time being. But it is possible, and improvements in efficiency have not been exhausted.
Equivalent but different implementations of homomorphic code can obviously be made. However, given the only credible explanations for design decisions of the new code are, to exactly match the original code, this precludes any "clean room" defenses.
--
Implementing software with neural network models wouldn't stop replication, but would decompile as source that was clearly not developed independent from the original implementation.
Even distilling (training a new model on the "decompiled" model) would be dead giveaway that it was derived directly from the source, not a clean room implementation.
--
I have wondered, if quantum computing wouldn't enable an efficient version of homomorphic computing over classical data.
Just some wild thoughts.
If you are able to monitor what happens to encrypted data being processed by an LLM, could you not match that with the same patterns created by unencrypted data?
Real simple example, let's say I have a program that sums numbers. One sends the data to an LLM or w/e unencrypted, the other encrypted.
Wouldn't the same part of the LLM/compute machine "light up" so to speak?
Like, if it ever leaks, or you were planning on releasing it, literally every step you took in your crime is uploaded to the cloud ready to send you to prison.
It's what's stopped me from using hosted LLMs for DMCA-legal RE. All it takes is for a prosecutor/attorney to spin a narrative based on uploaded evidence and your ass is in court.
Have you tried asking them to simply open source the code?
Even if they were willing to (they're not) and if they still have the code (they don't), it will contain proprietary code from Nintendo and you'll never get your hands on that (legally)