I tried Fable vs Codex 5.5 xhigh on three different cases.
1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.
2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.
3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.
I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.
I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.
And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.
What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.
I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.
I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.
What I do not understand is:
> is this a sneaky way for companies to push users up the chain?
> Or is this a genuine fault in model design/resource allocation?
Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.
I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
Outside of the test, they are told “can you find this bug in this file?”
But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).
So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.
This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.
Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?
Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.
I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/
Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it
For reference: it's called Kernighan's Law, and can be found in the Second Edition of "The Elements of Programming Style", page 10 [1].
The original phrasing is:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
[1] https://archive.org/details/the-elements-of-programming-styl...
Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
That was a nice time. Let us get back to that time. Use open weights models. Own stuff.
This is interesting. The "reported to me like a colleague" part.
Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.
Thanks for sharing I suppose.
Or opus to opus
Or really any new thing to old thing
The user here is right in what they said but wrong in why they said it, essentially.
Every upgrade made what came before it appear awful in comparison, to such an extent that every upgrade was called "photorealistic" and people kept forgetting that they'd been using that description for the previous engines that they were now dismissing.
I do make mistakes though. Please check results.
to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
Perhaps it is a lot of small improvements all over the place, but the sum is a step change in capability.
The free will coin?
Do I have free will, or am I bounded by the laws of physics?
Even if you think my soul is completely independent of my body, there are theologians who argue that God being omniscient means that who goes to heaven and hell is predetermined before birth and therefore no action you take will ever change the afterlife you go to, and that to think God isn't omniscient would be blasphemy; do they think I have free will?
And then there's Thelma with "Do what thou wilt shall be the whole of the Law", which can be understood in terms of (amongst other things) "Don't let peer pressure manipulate you into thinking you want other things than you really want", though this is of course a simplification much as the omniscient example above: https://en.wikipedia.org/wiki/True_Will
And, false positives are reported in the results.
But, Gemini CLI is deprecated. So, I tried to use Antigravity and it simply refused.
Weirdly, Gemma 4 has proven to be excellent at this task in subsequent tests. The best in its size/class. So, not everybody at Google is determined to break Google models for security work.
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.
These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:
- humans are really terrible
- the model probably has an easier time picking up special purpose tools you can use to write proven secure systems
I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)
Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).
Especially when these systems are written in notoriously unreliably languages like C.
I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.
In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.
See https://en.wikipedia.org/wiki/SeL4 for a noteworthy example.
"I’d say this benchmark answers with a resounding, “Maybe.”
Mythos maybe really is better than the other current models at finding security bugs"
Yet in the results, I don't see Mythos?
It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?
Mythos is the 100% against which the other models are compared.
Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.
GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.
It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.
If anyone wants to fund the other five cases (~$125), I'll run them. I find that an unrealistic cost, though...simply not useful data. I'm certainly not going to spend $23 per file to audit a project with hundreds or thousands of files. I don't know anyone who would.
Also note that it was $100 cap per model, and the next most expensive model was GPT 5.5 at a 20th the price per case, about ten bucks for the whole batch.