E.g. if you open this in browser, you’ll get the challenge: https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4...
But if you run this, you get the page content straight away:
curl https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4e6d8360c24d6b8434cbb8869b
I’m pretty sure this gets abused by AI scrapers a lot. If you’re running Anubis, take a moment to configure it properly, or better put together something that’s less annoying for your visitors like the OP.https://foundation.wikimedia.org/wiki/Policy:Wikimedia_Found...
[0]: https://en.wikipedia.org/wiki/User-Agent_header#Format_for_h...
[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
In practice, it hasn't been an issue for many months now, so I'm not sure why you're so sure. Disabling Anubis takes servers down; allowing curl bypass does not. What makes you assume that aggressive scrapers that don't want to identify themselves as bots will willingly identify themselves as bots in the first place?
I work with ffmpeg so I have to access their bugtracker and mailing list site sometimes. Every few days, I'm hit with the Anubis block. And 1/3 - 1/5 of the time, it fails completely. The other times, it delays me by a few seconds. Over time, this has turned me sour on the Anubis project, which was initially something I supported.
It's like airplane checkin. Are we inconvenienced? Yes. Who is there to blame? Probably not the airline or the company who provides the services. Probably people who want to fly without a ticket or bring in explosives.
As long as Anubis project and people on it don't try to play both sides and don't make the LLM situation worse (mafia racket style), I think if it works it works.
That quote is strong indication that he sees it this way.
Sounds like maybe it'll be fixed soon though
The math on the site linked here as a source for this claim is incorrect. The author of that site assumes that scrapers will keep track of the access tokens for a week, but most internet-wide scrapers don't do so. The whole purpose of Anubis is to be expensive for bots that repeatedly request the same site multiple times a second.
When reviewing it I noticed that the author carried the common misunderstanding that "difficulty" in proof of work is simply the number of leading zero bytes in a hash, which limits the granularity to powers of two. I realize that some of this is the cost of working in JavaScript, but the hottest code path seems to be written extremely inefficiently.
for (; ;) {
const hashBuffer = await calculateSHA256(data + nonce);
const hashArray = new Uint8Array(hashBuffer);
let isValid = true;
for (let i = 0; i < requiredZeroBytes; i++) {
if (hashArray[i] !== 0) {
isValid = false;
break;
}
}
It wouldn’t be exaggerating to say that a native implementation of this with even a hair of optimization could reduce the “proof of work” to being less time intensive than the ssl handshake.Proof of work can't function as a counter-abuse challenge even if you assume that the attackers have no advantage over the legitimate users (e.g. both are running exactly the same JS implementation of the challenge). The economics just can't work. The core problem is that the attackers pay in CPU time, which is fungible and incredibly cheap, while the real users pay in user-observable latency which is hellishly expensive.
Specifically for Firefox [1] they switch to the JavaScript fallback because that's actually faster [2] (because of overhead probably):
> One of the biggest sources of lag in Firefox has been eliminated: the use of WebCrypto. Now whenever Anubis detects the client is using Firefox (or Pale Moon), it will swap over to a pure-JS implementation of SHA-256 for speed.
[0] https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypt...
[1] https://github.com/TecharoHQ/anubis/blob/main/web/js/algorit...
[2] https://github.com/TecharoHQ/anubis/releases/tag/v1.22.0
Why is this inefficient?
OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
The firewall is now moot.
The bigger AI company, Google, has already been doing this for decades. They were the middlemen between your reader and you, and that position is unassailable. Without them, you don't have readers.
At this point, the only people you're keeping out with LLM firewalls are the smaller players, which further entrenches the leaders.
OpenAI and Google want you to block everybody else.
Do you have any proof, or even circumstantial evidence to point to this being the case?
If chrome actually scraped every site ever you visited and sent it off to Google, it’d be trivially simple to find some indication of that in network traffic, or heck - even chromium code.
Who would dare block Google Search from indexing their site?
The relationship is adversarial, but necessary.
But for anyone whose main concern is their server staying up, Atlas isn't a problem. It's not doing a million extra loads.
Would you trust OpenAI if they told you it doesn't?
If you would, would you also trust Meta to tell you if its multibillion dollar investment was trained on terabytes of pirated media the company downloaded over BitTorrent?
Personally I would just believe what they say for the time being; there would be backlash in doing something else, possibly legal one.
That isn't a conspiracy theory, it's fundamentally how interfacing with 3rd party hosted LLMs works.
Unless the user asked something that just needs visiting many pages, I suppose. For example, Google Gemini was pretty helpful in finding out the typical price ranges and dishes a local shopping centre coffee shops have, as the information was far from being just in a single page..
It's definitely pointless if you completely miss the point of it.
> OpenAI Atlas defeats all of this by being a user's web browser. They got between you and the user you're trying to serve content, and they slurp up everything the user browses to return it back for training.
Cool. Anubis' fundamental purpose is not to prevent all bot access tho, as clearly spelled in its overview:
> This program is designed to help protect the small internet from the endless storm of requests that flood in from AI companies.
OpenAI atlas piggybacking on the user's normal browsing is not within the remit of anubis, because it's not going to take a small site down or dramatically increase hosting costs.
> At this point, the only people you're keeping out with LLM firewalls are the smaller players
Oh no, who will think of the small assholes?
The point of the article is that if the scraper is sufficiently motivated, Anubis is not going to do much anyway, and if the scraper doesn't care, same result can be achieved without annoying your actual users.
Am I missing something here? All this does is set an unencrypted cookie and reload the page right?
Anubis isn’t some conspiracy to show you pictures of anime catgirls, it’s a desperate attempt to stave off bot-driven downtime. Many admins who install it do so reluctantly, because obviously it is annoying to have a delay when you access a website. Nobody is doing that for fun.
(There are probably a few people who install it not to protect against scraper DDoS, but due to ideological opposition to AI scrapers. IMHO this is fruitless, as the more intelligent scrapers will find ways around it without calling attention to themselves. Anubis makes almost no sense on a static personal blog.)
I can't fully articulate it but I feel like there is some game theory aspect of the current design that's just not compatible with the reality.
I have a personal website that sometimes doesn't get an update for a year. Still the bots are in the majority of visitors. (Not so much that I would need counter measures but still.) Most bot visits could be avoided with such a scheme.
With footnote:
"I don’t know if they have any good competition, but “Cloudflare” here refers to all similar bot protection services."
That's the crux. Cloudflare is the default, no one seems to bother to take the risk with a competitor for some reason. They seem to exist but when asked people can't even name them.
(For what it's worth I've been using AWS Cloudfront but I had to think a moment to remember its name.)
Admittedly, this is no different than the kinds of ways Anubis is hostile to those same users, truly a tragedy of the commons.
In the ongoing arms race, we're likely to see simple things like this sort of check result in a handful of detection systems that look for "set a cookie" or at least "open the page in headless chrome and measure the cookies."
Does anyone have any proof of this?
I mean they have access to a mind-blowing amount of computing resources so to they using a fraction of that to improve the quality of the data because they have this fundamental belief (because it's convenient for their situation) that scale is everything, why not use JS too. Heck if they have to run on a container full a browser, not even headless, they will.
Navigate, screenshots, etc. it has like 30 tools in it alone.
Now we can just run real browsers with LLMs attached. Idk how you even think about defeating that.
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.
> Yeah, but only because the LLM bots simply don’t run JavaScript.
I don't think that this is the case, because when Anubis itself switched from a proof-of-work to a different JavaScript-based challenge, my server got overloaded, but switching back to the PoW solution fixed it [0].
I also semi-hate Anubis since it required me to add JS to a website that used none before, but (1) it's the only thing that stopped the bot problem for me, (2) it's really easy to deploy, and (3) very few human visitors are incorrectly blocked by it (unlike Captchas or IP/ASN bans that have really high false-positive rates).
It's kind of a self fulfilling prophecy, you make it the visitor experience worse, giving a self justification why llm giving the content is wanted and needed.
All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.
A web forum I read regularly has been playing whack-a-mole with LLM scrapers for much of this year, with multiple weeks-long periods where the swarm-of-locusts would make the site inaccessible to actual users.
The admins tried all manner of blocks, including ultimately banning entire countries' IP ranges, all to no avail.
The forum's continued existence depends on being able to hold off abusive crawlers. Having to see half-a-second of the Anubis splashscreen occasionally is a small price to pay for keeping it alive.
> have to face a 3s stupid nagscreens like the one of anubis, I'm very pissed off and pushed even more to bypass the website when possible to get the info I want directly from llm or search engine.
Most (freely accessible) LLMs will take more than 3s to "think". Why are you pissed off about Anubis, but not the slow LLM? And then you have to double check the LLM anyway...
> All of that because in the current lambda/cloud computing word, it became very expensive to process only a few requests.
You're making some very arrogant assumptions here. FOSS repos and bugtrackers are generally not lambda/cloud hosted.
One thing I noticed though was that the Digital Ocean Marketplace image asks you if you want to install something called Crowdsec, which is described as a "multiplayer firewall", and while it is a paid service, it appears there is a community offering that is well-liked enough. I actually was really wondering what downsides it has (except for the obvious, which is that you are definitely trading some user privacy in service of security) but at least in principle the idea seems kind of a nice middleground between Cloudflare and nothing if it works and the business model holds up.
What I realised recently is for non user browsers my demos are effectively zip bombs.
Why?
Because I stream each frame and each frame is around 180kb uncompressed (compressed frames can be as small as 13bytes). This is fine as the users browser doesn't hold onto the frames.
But, a crawler will hold onto those frames. Very quickly this ends up being a very bad time for them.
Of course there's nothing of value to scrape so mostly pointless. But, I found it entertaining that some scummy crawler is getting nuked by checkboxes [1].
Work functions make sense in password hashes because they exploit an asymmetry: attackers will guess millions of invalid passwords for every validated guess, so the attacker bears most (really almost all) of the cost.
Work functions make sense in antispam systems for the same reason: spam "attacks" rely on the cost of an attempt being so low that it's efficient to target millions of victims in the expectation of just one hit.
Work functions make sense in Bitcoin because they function as a synchronization mechanism. There's nothing actually valorous about solving a SHA2 puzzle, but the puzzles give the whole protocol a clock.
Work functions don't make sense as a token tax; there's actually the opposite of the antispam asymmetry there. Every bot request to a web page yields tokens to the AI company. Legitimate users, who far outnumber the bots, are actually paying more of a cost.
None of this is to say that a serious anti-scraping firewall can't be built! I'm fond of pointing to how Youtube addressed this very similar problem, with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
Agreed, residential proxies are far more expensive than compute, yet the bots seem to have no problem obtaining millions of residential IPs. So I'm not really sure why Anubis works—my best guess is that the bots have some sort of time limit for each page, and they haven't bothered to increase it for pages that use Anubis.
> with a content protection system built in Javascript that was deliberately expensive to reverse engineer and which could surreptitiously probe the precise browser configuration a request to create a new Youtube account was using.
> The next thing Anubis builds should be that, and when they do that, they should chuck the proof of work thing.
They did [0], but it doesn't work [1]. Of course, the Anubis implementation is much simpler than YouTube's, but (1) Anubis doesn't have dozens of employees who can test hundreds of browser/OS/version combinations to make sure that it doesn't inadvertently block human users, and (2) it's much trickier to design an open-source program that resists reverse-engineering than a closed-source program, and I wouldn't want to use Anubis if it went closed-source.
[0]: https://anubis.techaro.lol/docs/admin/configuration/challeng...
Either way: what Anubis does now --- just from a CS perspective, that's all --- doesn't make sense.
So the "PoW tax" essentially only applies to low volume requester who have no incentive to optimize or bespoke solution too diverse to optimize at scale.
https://yumechi.jp/en/blog/2025/proof-of-mutex-outspeeding-a...
https://github.com/eternal-flame-AD/pow-buster
The problem was "fixed" but then reverted because the fix has deadlock bug. (Changelog entry: "Remove bbolt actorify implementation due to causing production issues.")
We're somehow still stuck with CAPTCHAs (and other challenges), a 25 years old concept that wastes millions of human hours and billions in infra costs [0].
Its time to start do own walled gardens, build overlay VPN networks for humans. Put services there, if someone misbehave? BAN his IP. Came back? BAN again. Came back? wtf? BAN VPN provider.. Just clean the mess.. different networks can peer and exchange. Look, Internet is just network of networks, its not that hard.
Cool... but I guess now we need a benchmark for such solutions. I don't know the author, I roughly know the problem (as I self host and most of my traffic now comes from AI scrapper bots, not the usual indexing bots or, mind you, humans) but when they are numerous solutions to a multi-dimensional problem I need a common way to compare them.
Yet another solution is always welcomed but without being able to efficiently compare it doesn't help me to pick the right one for me.
It is a shitty, and obviously bad solution for preventing scraping traffic. The goal of scraping traffic isn't to overwhelm your site, it's to read it once. If you make it prohibitively expensive to read your site even once, nobody comes to it. If you make it only mildly expensive, nobody scraping cares.
Anubis is specifically DDOS protection, not generally anti-bot, aside from defeating basic bots that don't emulate a full browser. It's been cargo-culted in front of a bunch of websites because of the latter, but it was obviously not going to work for long.
If the authors of the scrapers actually cared about it, we wouldn't have this problem in the first place. But today the more appropriate description is: the goal is to scrape as much data as possible as quickly as possible, preferably before your site falls over. They really don't care and side effects beyond that. Search engines have an incentive to leave your site running. AI companies don't. (Maybe apart from perplexity)