Project Glasswing: An Initial Update
95 points
1 hour ago
| 14 comments
| anthropic.com
| HN
mdeeks
31 minutes ago
[-]
You can get a taste of this today yourself with Codex Security. I turned it on just as an experiment and in less than a week it has now become essential to all of us. I was shocked how accurate it is, how many security issues it found in existing code, how it continually finds them as we commit, and how NO ONE is immune from making these mistakes.

I'd say it is about 90% accurate for us. Often even the "Low" findings lead us to dig and realize it is actually exploitable. Everyone makes these mistakes, from the most junior to the most senior. They are just a class of bugs after all.

I expect tools like this to be a regular part of the development lifecycle from here on. We code with AI, we review with AI, we search for vulns with AI. Even if it isn't perfect, it is easily worth the cost IMHO. Highly recommend you get something enabled for your own repos ASAP

reply
winstonwinston
9 minutes ago
[-]
> I expect tools like this to be a regular part of the development lifecycle from here on. We code with AI, we review with AI, we search for vulns with AI. Even if it isn't perfect, it is easily worth the cost IMHO.

So, how is that supposed to work? Claude Code generates security bugs, then Claude Security finds them, then Claude Code generate fix, spend tokens, profit?

reply
siva7
1 minute ago
[-]
So? That's how a business works. We sold you landmines and now you need them removed? Lucky you we also have mine clearance products.
reply
jimmy2times
3 minutes ago
[-]
The AIs have already figured out how to succeed in a software job:

1. Ship bugs

2. Fix them

3. You're the hero!

reply
Version467
18 minutes ago
[-]
I’ve had the same experience. The ui is a little unclear about this, because it says you have 5 scans, but 1 scan is just the continuous monitoring of the default branch of a repo.

The high impact findings have almost all been bang on for me. I was especially surprised by the high-quality documentation it produces as well as how narrow the proposed fixes are.

I’m used to codex producing quite a but more code than it needs to, but the security model proposed fixes that are frequently <10 loc, targeting exactly the correct place.

It’s really quite good. I’m assuming it’ll be pretty expensive once out of beta, but as a business I’d be jumping on this.

reply
rmast
4 minutes ago
[-]
I help maintain a project that is used as a dependency by a lot of security tools to handle PE files.

It’s disappointing that Anthropic and OpenAI never responded to the applications to their respective programs for open source maintainers. From my perspective it seems like their offers are primarily for the shiny well-known projects, rather than ones that get only a few million monthly installs but aren’t able to get thousands of stars due to being “hidden” as a dependency of popular tool.

reply
0xAstro
19 minutes ago
[-]
I would recommend you to try out the setup with gpt-5.5-cyber as the orchestrator and deepseek-v4-flash or some other fast cheap model as its workers. Getting pretty good results using this setup.
reply
nikcub
10 minutes ago
[-]
There has been a lot of cynicism around mythos, that it's just the usual public models without guardrails, etc. etc. but this:

> 1,752 of those high- or critical-rated vulnerabilities have now been carefully assessed by one of six independent security research firms, or in a small number of cases by ourselves. Of these, 90.6% (1,587) have proved to be valid true positives, and 62.4% (1,094) were confirmed as either high- or critical-severity.

for anybody who has applied opus, codex or oss models for vuln scanning - the true positive rate and discovery volume are a clear step change[0]. The ~50 partners in Glasswing have largely all previously run harnesses with other models and many of them have come out and said - essentially - "ye, wow"

Question now is what a second and third phases of access looks like - deciding which class of systems to secure. Routers, firewalls, SaaS, ERP systems, factory controllers, SCADA systems, zero-trust VPN gateways, telecoms gear and networks, medical devices - there's just so much to do

This is why I believe mythos will remain private for the foreseeable future. There's such a large surface that needs to be secured and so much to triage, fix, deploy.

That may suit Anthropic as private models can't be distilled. There's also a runaway effect of model improvement from the discovery, triage and fix data. This is likely already the most potent corpus of curated offensive data ever assembled and will only get better.

I don't see how Chinese companies are given access soon, or ever. We're likely going to see a world soon of CISA mandated audits, and where to buy a mythos-proof VPN gateway or home router - you'll have to buy American[1].

[0] vs ~30% or so in regular audit tools

[1] or allied

reply
0xAstro
37 minutes ago
[-]
I had a fun day today where I had deepseek-v4-flash subagents work out patch for dirty frag for systems with AF_ALG disabled and nscd turned on, to gain root access. The original published exploit wasn't working but the patched one worked like a charm.

I am still a believer that a 100 subagents with good-enough intelligence can get same results as mythos, I am ready for this opinion to be shattered when I eventually try mythos and I believe others here must have tried mythos out too.

reply
lukeschlather
17 minutes ago
[-]
That's probably true, but when you're talking about 100 subagents you're talking about something that costs $100/hour to run, and Mythos takes $20k to find a vulnerability, so the question isn't "can dumber models conceivably do this?" It's, if running inference with Mythos to find an exploit costs 5000 GPU-hours per exploit, how many GPU-hours does it cost with a dumber model?
reply
giancarlostoro
10 minutes ago
[-]
> Since then, we and our approximately 50 partners have used Claude Mythos Preview to find more than ten thousand high- or critical-severity vulnerabilities across the most systemically important software in the world. Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it’s limited by how quickly we can verify, disclose, and patch the large numbers of vulnerabilities found by AI.

I guess they forgot to scan Visual Studio Code plugins and their endless npm dependencies.

reply
pixl97
7 minutes ago
[-]
I mean that's really a different issue.
reply
rsync
12 minutes ago
[-]
I asked in a different thread:

Do we have a sense that projects like OpenBSD/OpenSSH, FreeBSD, ISC[1] and Apache were included in the "blessed" initial participants in Project Glasswing ?

Or is it big name tech companies, banks and fashionable languages and package managers ?

[1] Bind, DHCP

reply
chopete3
11 minutes ago
[-]
>> Next, we will work with critical partners—including US and allied governments—to expand Project Glasswing to additional partners.

That means, they intend to make a load of money before a general release. It is a good strategy.

reply
OsrsNeedsf2P
1 hour ago
[-]
The vulnerabilities found continues to impress, and make legacy media, Twitter and Youtube go nuts. But we still have no data to prove this wasn't doable with the same initiative backed by Opus 4.7, and there is no GA for Mythos access.
reply
krisbolton
47 minutes ago
[-]
There is independent research out there on frontier model security capability. AI Security Institute (UK) put out their paper comparing Mythos to other frontier models in early April. They've been tracking frontier model security capability since early 2023, so it's a decent dataset. https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos...
reply
energy123
1 hour ago
[-]
. Mozilla found and fixed 271 vulnerabilities in Firefox 150 while testing Mythos Preview—over ten times more than they found in Firefox 148 with Claude Opus 4.6;
reply
kllrnohj
37 minutes ago
[-]
No, not really. Mythos found 3 CVEs, not 271.

https://www.flyingpenguin.com/mythos-mystery-in-mozilla-numb...

reply
simonw
25 minutes ago
[-]
The Mozilla team responded to that argument here: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin... - in the FAQ.
reply
moyix
29 minutes ago
[-]
I think you're confusing CVEs and vulnerabilities here? Mozilla (per their longstanding practice) grouped multiple vulnerabilities found internally under a small number of CVEs.
reply
applfanboysbgon
48 minutes ago
[-]
Did they allocate the same number of tokens to looking with Claude 4.6? Or did they find more because they looked more, owing to a special initative by Anthropic?
reply
properbrew
40 minutes ago
[-]
> over ten times more than they found in Firefox 148 with Claude Opus 4.6

And how much with Opus 4.7? 5x?

reply
parker-3461
1 hour ago
[-]
Makes me wonder if Anthropic is really having issues with allocating compute (see recent deals with xAI and SpaceX). From available benchmarks, it seems like similar results should be possible with GPT 5.5 Pro or Opus 4.7 (with specific cybersecurity trained models).
reply
smoe
52 minutes ago
[-]
At least according to this, GPT-5.5 Cyber is on par with Mythic, as the only two models that were able to finish their 32-step corporate network attack simulation.

https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

reply
wiwiwq
55 minutes ago
[-]
Who knows but from a valuation stand point it’s better to signal that demand is higher than existing capacity..
reply
bobbycastorama
1 hour ago
[-]
I've seen a blog post by a security researcher saying that he was able to find the same vulnerabilities (for Firefox IIRC) with a ~30B params LLM...

So yeah, huge marketing as always.

reply
nikcub
6 minutes ago
[-]
Finding the neeedle is easier when you remove the haystack

Or providing a map with a direction

There is a long history of high-value private vulns being rediscovered from scant details

reply
simonw
20 minutes ago
[-]
You mean this one? https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag...

That's the one that says:

> We took the specific vulnerabilities Anthropic showcases in their announcement, isolated the relevant code, and ran them through small, cheap, open-weights models. Those models recovered much of the same analysis.

reply
krisbolton
41 minutes ago
[-]
This is different though right? He found one (? we don't know who you're referring to - post sources for a higher quality discussion) vulnerability, he already knew it was there, etc. Anthropic didn't claim no other model can find vulnerabilities, nor that it's impossible with smaller models. They're claiming Mythos is a step-change in ability for end-to-end vulnerability discover and exploit creation. And that other frontier models are close behind.
reply
Brystephor
52 minutes ago
[-]
Did the security researcher point the LLM at the blob of information and say "Find vulnerabilities" or was the LLM told to "determine if vulnerability X is present in this blob"? Confirmation of suspected vulnerabilities is a different problem from finding vulnerabilities.
reply
wiwiwq
58 minutes ago
[-]
To me it’s clear what’s going on.

The American firms are focused on marketing now to convince people to not even consider open sourced models / open weight models as they are inferior (that’s what they want you to believe).

reply
rhubarbtree
56 minutes ago
[-]
IPO is coming is what is going on
reply
wiwiwq
54 minutes ago
[-]
That’s implicit in my post.

If people actually believe the narrative then the bankers will over price Anthropic and get away with it.

reply
pertymcpert
1 hour ago
[-]
> Mozilla found and fixed 271 vulnerabilities in Firefox 150 while testing Mythos Preview—over ten times more than they found in Firefox 148 with Claude Opus 4.6

4.6 but close.

reply
OsrsNeedsf2P
59 minutes ago
[-]
Right, but were they using the same methodology and harness? I'm skeptical that they're doing something with the harness - i.e. with Mythos, they pass each file in one at a time, whereas on 4.6 they let Claude Code run loose to find bugs. This would have a larger impact difference than the model itself.
reply
boston_clone
1 hour ago
[-]
you would likely be quite interested in the more quantitative writeup from a real research team ! it’s linked about midway in to the article - similar functionally can be reached, yes, but not always and never with fewer tokens than what mythos requires.

https://xbow.com/blog/mythos-offensive-security-xbow-evaluat...

reply
OsrsNeedsf2P
57 minutes ago
[-]
Ok this is actually a pretty good article and justifies the step function marketing in security they talked about
reply
enlightenedfool
51 minutes ago
[-]
Is this the God model that no one else can build? Unbelievable.
reply
arjie
34 minutes ago
[-]
The era where you could reputably believe things published by anyone on this front is over. If you want this information, you’re going to have to attempt it yourself with the Opus API. It is entirely possible that any released model access will be heavily guardrailed against hacking attempts and Mythos is just an unrailed model. It is entirely possible that Mythos is a different architecture or size. We can’t know from the outside.

There is also a pretty big risk that anyone who is not you would leak the answer to the test. We are close to n=1 epistemics here. You’re going to have to do the research yourself.

reply
InsideOutSanta
44 minutes ago
[-]
I wonder if it coincidentally becomes safe to release when compute capacity bought from SpaceX will provide enough headroom to let a lot more people run it.
reply
sigmar
10 minutes ago
[-]
"available to qualifying customers’ security teams on request." Seems they're already expanding access.
reply
lukeschlather
19 minutes ago
[-]
It seems like Mythos is often (or typically?) costing $20k per vulnerability, so I don't think there will be enough compute capacity in the world any time soon to let a lot more people use it the way Glasswing is using it. That is not to say I think they are exaggerating its capabilities. That $20k is presumably the rough cost of renting the GPUs, and there are not enough GPUs in the world.
reply
why_only_15
14 minutes ago
[-]
what's the origin of your $20k/vuln estimate?
reply
b65e8bee43c2ed0
15 minutes ago
[-]
stop noticing things, chud.
reply
mlazos
12 minutes ago
[-]
I believe them to some degree but this trend of posting stuff when it can’t be verified actually needs to end. I’m so tired of this bs marketing.
reply
antirez
9 minutes ago
[-]
I have the feeling posts like that should be 1/4 the size, at max. At this point I don't care if it is AI-slop or human-slop: they are surprisingly alike. Information must be more dense, each sentence must carry some truth.
reply
vincefutr23
20 minutes ago
[-]
Mythos couldn’t find the “tens thousand” typo in this post?
reply
orangebread
31 minutes ago
[-]
BOOO RELEASE THE MODEL ALREADY GAWD
reply
guluarte
12 minutes ago
[-]
after IPO
reply
ares623
17 minutes ago
[-]
> good lord what is happening in there?!

> that's just thousands of vulnerabilities being discovered by our trillion parameter model

> thousands of vulnerabilities and trillions of parameters?! At current energy prices, in this economic climate, isolated entirely within your datacenter?

> yes

> may we see it?

> no

reply
amusingimpala75
1 hour ago
[-]
[edit: TFA addresses this, though I still find crazy 90% accuracy overall vs 20% accuracy for curl]

Is this suspected vulns or actual vulns? If I recall correctly, it produced 5 for curl but only 1 was legit

reply
Smaug123
1 hour ago
[-]
> So far, Mythos Preview has found what it estimates are 6,202 high- or critical-severity vulnerabilities in these projects (out of 23,019 in total, including those it estimates as medium- or low-severity).

> 1,752 of those high- or critical-rated vulnerabilities have now been carefully assessed by one of six independent security research firms, or in a small number of cases by ourselves. Of these, 90.6% (1,587) have proved to be valid true positives, and 62.4% (1,094) were confirmed as either high- or critical-severity. That means that even if Mythos Preview finds no further vulnerabilities, at our current post-triage true-positive rates, it’s on track to have surfaced nearly 3,900 high- or critical-severity vulnerabilities in open-source code

reply
extr
44 minutes ago
[-]
Did you RTFA?
reply
rbranson
53 minutes ago
[-]
I don't know why you're getting downvoted. This is exactly what was reported by curl's creator under the section "Five findings became one": https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-v...
reply
the_mitsuhiko
1 minute ago
[-]
And yet [1]:

> Not even half-way through this #curl release cycle we are already at 11 confirmed vulnerabilities - and there are three left in the queue to assess and new reports keep arriving at a pace of more than one/day.

> 11 CVEs announced in a single release is our record from 2016 after the first-ever security audit (by Cure 53).

> This is the most intense period in #curl that I can remember ever been through.

[1]: https://www.linkedin.com/feed/update/urn:li:activity:7463481...

reply
Smaug123
48 minutes ago
[-]
I think it's more that the requested information is prominently featured in the article, and indeed is the content of the only graphic in the article below the intro banner.
reply
RamRodification
1 hour ago
[-]
This is marketing. So probably suspected. Or somewhere in between.
reply