AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders
223 points
by rntn
1 day ago
| 25 comments
| theregister.com
| HN
pjc50
1 day ago
[-]
Place alongside https://news.ycombinator.com/item?id=44962529 "Why are anime catgirls blocking my access to the Linux kernel?". This is why.

AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.

reply
rnhmjoj
1 day ago
[-]
I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.
reply
Fomite
1 day ago
[-]
I'd argue it's part of the baked in, fundamental disrespect AI firms have for literally everyone else.
reply
mostlysimilar
1 day ago
[-]
But it didn't, and it's happening now, because of AI.
reply
kjkjadksj
1 day ago
[-]
People have been complaining about these crawlers for years well before AI
reply
PaulDavisThe1st
1 day ago
[-]
The issue is 1 to 4 orders of magnitude worse than it was just a couple of years ago. This is not "crawlers suck". This is "crawlers are overwhelming us and almost impossible to fully block". It really isn't the same thing.
reply
tadfisher
1 day ago
[-]
Tragedy of the commons. Before, it was cryptominers eating up all free sources of compute [0]. Now it's AI crawlers eating up all available bandwidth and server resources [1]. Reading SourceHut's struggles against the Once-lers of the world makes me want to introduce a new application layer protocol where consumers pay for abusing shared resources. Which sucks, because the Internet should remain free.

[0]: https://drewdevault.com/2021/04/26/Cryptocurrency-is-a-disas...

[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

reply
PaulDavisThe1st
1 day ago
[-]
> Tragedy of the commons.

No, because there is no such thing, at least not as understood by Garrett Hardin, who put forward the phrase.

Commons fail when selfish, greedy people subvert or destroy the governance structures that help control them. If those governance structures exist (and they do for all historical commons) and continue to exist, the commons suffers no tragedy.

This recent slide deck talks about Ostrom's ideas on this, which even Hardin eventually conceded were correct, and that his diagnosis of a "tragedy of the commons" does not actually describe the historical processes by which commons are abused.

https://dougwebb.site/slides/commons

That said ... arguably there is a problem here with a "commons" that does in fact lack any real governance structure.

reply
erlend_sh
15 hours ago
[-]
No idea why this is getting downvoted; this is a very important correction since the “tragedy of the commons” meme is based on a flawed premise that needs to be amended.
reply
p3rls
17 hours ago
[-]
i am getting almost 500,000 ai scraper requests a day according to cloudflare's ai audit. google requests the same pages 10+ times each an hour. it was never this bad before.
reply
majkinetor
1 day ago
[-]
Obaying robots.txt can not be enforced. Even if one country makes laws about it, another one will have 0 fucks to give.
reply
spinningslate
1 day ago
[-]
It was never intended to be "enforced":

> The standard, developed in 1994, relies on voluntary compliance [0]

It was conceived in a world with an expectation of collectively respectful behaviour: specifically that search crawlers could swamp "average Joe's" site but shouldn't.

We're in a different world now but companies still have a choice. Some do still respect it... and then there's Meta, OpenAI and such. Communities only work when people are willing to respect community rules, not have compliance imposed on them.

It then becomes an arms race: a reasonable response from average Joe is "well, OK, I'll allows anyone but [Meta|OpenAI|...] to access my site. Fine in theory, dificult in practice:

1. Block IP addresses for the offending bots --> bots run from obfuscated addresses

2. Block the bot user agent --> bots lie about UA.

...and so on.

[0]: https://en.wikipedia.org/wiki/Robots.txt

reply
majkinetor
1 day ago
[-]
Thanks for the info. However people seem to think that robots.txt will protect them while it was created for another world as you nicelly stated. I guess Nepenthes like tools will be more common in the future, now that tragedy of commons entered digital domain.
reply
sznio
11 hours ago
[-]
I strongly believe that AI companies are running a DDOS attack on the open web. Making websites go down aligns with their intetests: it removes training data that competitors could use, and it removes sources for humans to browse, making us even more reliant on chatbots to find anything.

If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.

reply
1vuio0pswjnm7
1 day ago
[-]
"It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished [sic] conventions (respecting robots.txt, using proper UA string, rate limiting, whatever)."

How does "proper UA string" solve this "blowing up websites" problem

The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour

"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure

"Spoofed" UA strings have been facilitated and expected since the earliest web browsers

For example,

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically

It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting

NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure

reply
AbortedLaunch
1 day ago
[-]
Some of these crawlers appear to be designed to avoid rate limiting based on IP. I regularly see millions of unique ips doing strange requests, each just one or at most a few per day. When a response contains a unique redirect I often see a geographically distinct address fetching the destination.
reply
1vuio0pswjnm7
22 hours ago
[-]
"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."

How would UA string help

For example, a crawler making "strange" requests can send _any_ UA string, and a crawler doing "normal" requests can also send _any_ UA string.

The "doing requests" is what I refer to as "behaviour"

A website operator might think "Crawlers making strange requests send UA string X but not Y"

Let's assume the "strange" requests cause a "website load" problem^1

Then a crawler, or any www user, makes a "normal" request and sends UA string X; the operator blocks or redirects the request, unnecessarily

Then a crawler makes "strange" request and sends UA string Y; the operator allows the request and the website "blows up"

What matters for the "blowing up websites" problem^1 is behaviour, not UA string

1. The article's title calls it the "blowing up websites" problem, but the article text calls it a problem with "website load". As always the details are missing. For example, what is the "load" at issue. Is it TCP connections or HTTP requests. What number of simultaneous connections and/or requests per second is acceptable, what number is not unacceptable. Again, behaviour is the issue, not UA string

The acceptable numbers need to be published; for example, see documentation for "web APIs"

reply
AbortedLaunch
14 hours ago
[-]
I do not make any point on UA-strings, just on the difficulty of rate limiting.
reply
1vuio0pswjnm7
3 hours ago
[-]
"Some of these crawlers appear to be designed to avoid rate limiting based on IP."

Unless the rate is exceeded, the limit is not being avoided

"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."

Assuming the rate limit is more than one or a few requests every 24h this would be complying with the limit, not avoiding it

It could be that sometimes the problem website operators are concerned about is not "website load", i.e., the problem the article is discussing, it is actually something else (NB. I am not speculating about this particular operator, I am making a general observation)

If a website is able to fulfill all requests from unique IPs without affecting quality of service, then it stands to reason "website load" is not a problem the website operator is having

For example, the article's title claims Meta is amongst the "worst offenders" of creating excessive website load caused by "AI crawlers, fetchers"

Meta has been shown to have used third party proxy services wth rotating IP addresses in order to scrape other websites; it also sued one of these services because it was being used to scrape Meta's website, Facebook

https://brightdata.com/blog/general/meta-dismisses-claim-aga...

Whether the problem that Meta was having with this "scraping" was "website load" is debatable; if the requests were being fulfilled without affecting QoS, then arguably "website load" was not a problem

Rate-limiting addresses the problem of website load; it allows website operators to ensure that requests from all IP addresses are adequately served as opposed to preferentially servicing some IP addresses to the detriment of others (degraded QoS)

Perhaps some website operators become concerned that many unique IP addresses may be under the control of a single entity, and that this entity may be a competitor; this could be a problem for them

But if their website is able to fulfill all the requests it receives without degrading QoS then arguably "website load" is not a problem they are having

NB. I am not suggesting that a high volume of requests from a single entity, each complying with a rate-limit is acceptable, nor am I making any comment about the practice of "scraping" for commercial gain. I am only commenting about what rate-limiting is designed to do and whether it works for that purpose

reply
msgodel
15 hours ago
[-]
This isn't really about AI. This is a couple corporations being bad netizens and abusing infrastructure.

The same incentives to do this already existed for search engine operators.

reply
superkuh
1 day ago
[-]
This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.

And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.

reply
autoexec
1 day ago
[-]
> This isn't AI damaging anything. This is corporations damaging things.

This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.

I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.

reply
superkuh
1 day ago
[-]
Correct. It's a good, legitimate argument in both contexts. I use both local AI and local firearms as a human person and I am not doing, and have not done, damage to anyone. The tools aren't the problem.

The problem in this case is the near complete protection from legal liability that corporate structures give to the people behaving badly. Like how Coca Cola can get away with killing people (https://prospect.org/features/coca-cola-killings/) but a person can't, if you want to keep the firearms analogy going. But it's a bad analogy because the firearms as tool actually at least are involved in the bad (and good) actions. AI itself isn't even involved in the HTTP requests and probably isn't even running on the same premises.

reply
ujkhsjkdhf234
1 day ago
[-]
Cloudflare exists because people can't be good stewards of the internet.

> This isn't AI damaging anything. This is corporations damaging things

This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.

reply
renewiltord
1 day ago
[-]
If you don't want to receive data, don't. If you don't want to send data, don't. No one is asking you to receive traffic from my IPs or send to my IPs. You've just configured your server one way.

Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.

reply
PaulDavisThe1st
1 day ago
[-]
I don't know if I want your traffic until I see what your traffic is.

You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!

Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.

But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.

I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.

reply
renewiltord
1 day ago
[-]
There's a protocol for that. Just reject the connection. Don't implode, just write some code. Your business model isn't my problem.
reply
ben_w
1 day ago
[-]
Reject 900k different connections from different origins each asking for what would in isolation be fine and the only problem is the quantity?
reply
Nextgrid
21 hours ago
[-]
But what's the difference between one user making 900k hits and 900k different users making one hit? In both cases you have made a resource available and people are requesting it, some more than others.

If serving traffic for free is a problem, don't. If you are only able to serve N requests per second/minute/day/etc, do that. But don't complain if you give out something for free and people take it.

(also, a lot of the numbers people quote during these AI scraper "attacks" are very tame and the fact they are branded as problematic makes me suspect there's substantial incompetence in the solutions deployed to serve them)

reply
PaulDavisThe1st
4 hours ago
[-]
There were never 900k users interested in each commit. Never was, never will be. So that's a false comparison.

These scrapers have upped both the server load (requests per second) and bandwidth requirements, without me consenting to it. If they were actual human users OR bots that were appropriately designed to minimize their impact on the target sites, that's perfectly OK.

Maybe if this was truly the only way to get to our god-like LLM to work in a god-like way (*), it would also be acceptable. But it isn't.

And on top of that, they are incompetently designed and they are causing real issues that a huge number of sites need to address.

(*) put differently, if all this current scraping activity delivered some notable benefit to humanity

reply
latexr
14 hours ago
[-]
> But what's the difference between one user making 900k hits and 900k different users making one hit?

What’s the difference between giving 900K meals to one person and feeding 900K people? The former is being abusive, wasteful, and depriving almost 900K other people of food. They are also being deceitful by pretending to be 900K different people.

Resources are finite. Web requests aren’t food, but you still pay for them. A spike in traffic may mean your service being down for the rest of the month, which is more acceptable if you helped a bunch of people who have now learned about and can talk about and share what you provided, versus having wasted all your traffic on a single bad actor who didn’t even care because they were just a robot.

> makes me suspect there's substantial incompetence in the solutions deployed to serve them

So you see bots scraping the Wikipedia webpages instead of downloading their organised dump, or scraping every git service webpage instead of cloning a repo, and think the incompetence is with the website instead of the scraper wasting time and resources to do a worse job?

reply
PaulDavisThe1st
1 day ago
[-]
Reject the connection based on what?

IP address (presumably after too many visits) ? So now the iptables mechanism has to scale to fit your business model (of hammering my git repository 1 commit at a time from nearly a million IP addresses) ? Why does the code I use have to fit your braindead model? We wouldn't care if you just used git clone, but you're too dumb to do that.

The URL? Legitimate human (or other) users won't be happy about that.

Our web-fronted git repo is not part of our business model. It's just a free service we like to offer people, unrelated to revenue flow or business operations. So your behavior is not screwing my business model, but it is screwing up people who for whatever reason want to use that service, who can no longer use the web-fronted git repo.

ps. I've used "you" throughout the above because you used "my". No idea if you personally are involved in any such behavior.

reply
whatevaa
15 hours ago
[-]
That's exactly what they are doing. Rejecting the connection of people like you, cause you don't care. And if you start your own bussiness, you will suddenly encounter the same problem too. Then you will be able to "just write some code".

Anytime somebody writes "just" you immedially can understand that they have no idea what they are talking about.

reply
msgodel
15 hours ago
[-]
Sure. What that looks like is always using ssh to access git and things like github going away. I think most of us can agree that's probably not good. For the tools non-technical people use it's probably far worse, pretty much the end of the open web outside static personal pages.

I think the ISPs serving these requests are probably going to have to start going after customers for being abusive in order for this to stop.

reply
renewiltord
13 hours ago
[-]
Seems fine to me. Same as ads. If you don’t want to send content with ads which I will render without ads don’t send. That ended some businesses and made others paywall.

Such is life.

reply
latexr
1 day ago
[-]
> Disconnect from me if you don’t want my traffic.

The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.

reply
renewiltord
1 day ago
[-]
Websites aren't people. They don't have desires. Machines have communication protocols. You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic. Do what you want with it.

People send me spam. I don't whine about it. I block it.

reply
latexr
1 day ago
[-]
> Websites aren't people. They don't have desires.

Obviously I’m talking about the people behind them, and I very much doubt you lack the minimal mental acuity to understand that when I used “website owners” in the preceding sentence. If you don’t want to engage in a good faith discussion you can just say so, no need to waste our time with fake pedantry. But alright, I edited that section.

> You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic.

And then you spend all your time in a game of cat and mouse, while these scrappers bring your website down and cost you huge amounts of money. Are you incapable of understanding how that is a problem?

> People send me spam. I don't whine about it. I block it.

Is the amount of spam you get so overwhelming that it swamps your inbox every day to a level you’re unable to find the real messages? Do those spammers routinely circumvent your rules and filters after you’ve blocked them? Is every spam message you get costing you money? Are they increasing every day? No? Then it’s not the same thing at all.

reply
whatevaa
15 hours ago
[-]
I would suggest not arguing with a wall, the person you are replying to thinks there exists some magic sauce of code to solve this problem.
reply
pjc50
23 hours ago
[-]
People are doing exactly that. And then other people who want to use the website are asking why they get blocked by false positives.
reply
renewiltord
2 hours ago
[-]
Yeah, it just seems like things are playing out as one would expect them to. You're right
reply
IT4MD
1 day ago
[-]
>AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet

10/10. No notes.

reply
mcpar-land
1 day ago
[-]
My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.
reply
fleebee
1 day ago
[-]
For me it was OpenAI. GTPBot hammered my honeypot with 0.87 requests per second for about 5 weeks. Other crawlers only made up 2% of the traffic. 1.8 million requests, 4 GiB of traffic. Then it just abruptly stopped for whatever reason.
reply
whatevaa
15 hours ago
[-]
Tar pits and serve fake but legitimate looking content. Poison it.
reply
Group_B
1 day ago
[-]
That's hilarious. I need to set up one of these myself
reply
bwb
1 day ago
[-]
My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.

I wish there were a better way to solve this.

reply
weaksauce
1 day ago
[-]
put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.
reply
bwb
1 day ago
[-]
Got a good link to something on github that does this?

I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.

reply
weaksauce
1 hour ago
[-]
here's one example using rack attack on a rails app:

https://github.com/pinballmap/pbm/blob/302ac638850711878ac61...

https://github.com/pinballmap/pbm/blob/302ac638850711878ac61...

but it only bans for 3 hours. if they don't respect the hidden link and robots.txt they get banned.

reply
subscribed
12 hours ago
[-]
I did it manually and got fail2ban to read the access log anyway.

The it's the permanent iptables rule, but could be CF API call as well.

reply
skydhash
1 day ago
[-]
If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.
reply
bwb
1 day ago
[-]
Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..
reply
shepherdjerred
18 hours ago
[-]
ah, you're the one who stopped me from being jerred@shepherd.com!
reply
bwb
10 hours ago
[-]
hah eh?
reply
p3rls
23 hours ago
[-]
At this point I'd take a thermostat that can read when my dashboard starts getting heated (always the same culprits causing these same server spikes) and flicks attack mode on for cloudflare.... it's so ridiculous trying to run anything that's not a wordpress these days
reply
rco8786
1 day ago
[-]
OpenAI straight up DoSed a site I manage for my in-laws a few months ago.
reply
muzani
1 day ago
[-]
What is it about? I'm curious what kinds of things people ask that floods sites.
reply
rco8786
1 day ago
[-]
The site is about a particular type of pipeline cleaning (think water/oil pipelines). I am certain that nobody was asking about this particular site or even the industry its in 15,000 times a minute 24 hours a day.

It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.

reply
dwd
18 hours ago
[-]
It's common to see them get stuck in a loop on online stores trying every combination of product filter over and over.

Even Googlebot has to be told to not crawl particular querystrings, but the AI crawlers are worse.

reply
rco8786
8 hours ago
[-]
Not an online store just a bog standard Wordpress site
reply
average_r_user
1 day ago
[-]
I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website
reply
tehwebguy
1 day ago
[-]
This is a feature! If half the internet is nuked and the other half put up fences there is less readily available training data for competitors.
reply
AutoDunkGPT
1 day ago
[-]
I love this for us!
reply
timsh
1 day ago
[-]
A bit off-topic but wtf is this preview image of a spider in the eye? It’s even worse than the clickbait title of this post. I think this should be considered bad practice.
reply
encrypted_bird
19 hours ago
[-]
I fully agree, and speaking as someone macroinsectophobia (fear of large or many insect (or insect-like) creatures), seeing it really makes me uncomfortable. It isn't enough to send me into panic mode or anything, but damn if it doesn't freak me out.
reply
shinycode
1 day ago
[-]
In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?
reply
rco8786
1 day ago
[-]
My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.
reply
Leynos
1 day ago
[-]
Where is caching breaking so badly that this is happening? Are OpenAI failing to use etags or honour cache validity?
reply
Analemma_
1 day ago
[-]
Their crawler is vibe-coded.
reply
danaris
1 day ago
[-]
Same here.

I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.

reply
mrweasel
1 day ago
[-]
Wikis seems to attract AI bots like crazy, especially the bad kind that will attempt any type of cache invalidation available to them.
reply
pm215
1 day ago
[-]
Sure, but if the fetcher is generating "39,000 requests per minute" then surely something has gone wrong somewhere ?
reply
miohtama
1 day ago
[-]
Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.
reply
ndriscoll
1 day ago
[-]
Could be serving no-cache headers? Seems like yet another problem stemming from every website being designed as if it were some dynamic application when nearly all of them are static documents. nginx doing 39k req/min to cacheable pages on an n100 is what you might call "98% idle", not "unsustainable load on web servers".

The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?

reply
mrweasel
1 day ago
[-]
Many AI crawlers seems to go to great length to avoid caches, not sure why.
reply
andai
1 day ago
[-]
They're not very good at web queries, if you expand the thinking box to see what they're searching for, like half of it is nonsense.

e.g. they'll take an entire sentence the user said and put it in quotes for no reason.

Thankfully search engines started ignoring quotes years ago, so it balances out...

reply
internet_points
1 day ago
[-]
They mention anubis, cloudflare, robots.txt – does anyone have experiences with how much any of them help?
reply
davidfischer
1 day ago
[-]
My employer, Read the Docs, has a blog on the subject (https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...) of how we got pounded by these bots to the tune of thousands of dollars. To be fair though, the AI company that hit us the hardest did end up compensating us for our bandwidth bill.

We've done a few things since then:

- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.

- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.

- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.

- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.

Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.

reply
nromiun
1 day ago
[-]
CDNs like Cloudflare are the best. Anubis is a rate limitor for small websites where you can't or won't use CDNs like Cloudflare. I have used Cloudflare on several medium sized websites and it works really well.

Anubis's creator says the same thing:

> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.

Source: https://github.com/TecharoHQ/anubis

reply
Nextgrid
21 hours ago
[-]
Alternatively, go-away is also an option: https://git.gammaspectra.live/git/go-away, especially if you don't want cringy branding.
reply
hombre_fatal
1 day ago
[-]
CloudFlare's Super Bot Fight Mode completely killed the surge in bot traffic for my large forum.
reply
ajsnigrutin
1 day ago
[-]
And added captchas to every user with an adblock or sensible privacy settings.
reply
pjc50
1 day ago
[-]
How would you suggest that such users prove they're not a crawler?
reply
ajsnigrutin
1 day ago
[-]
Why would they have to?

What's wrong with crawlers? That's how google finds you, and people find you on google.

Just put some sensible request limits per hour per ip, and be done.

reply
nucleardog
1 day ago
[-]
> Just put some sensible request limits per hour per ip, and be done.

I have no personal experience, but probably worth reading like... any of the comments where people are complaining about these crawlers.

Claims are that they're: ignoring robots.txt; sending fake User-Agent headers; they're crawling from multiple IPs; when blocked they will use residential proxies.

People who have deployed Anubis to try and address this include: Linux Kernel Mailing List, FreeBSD, Arch Linux, NixOS, Proxmox, Gnome, Wine, FFMPEG, FreeDesktop, Gitea, Marginalia, FreeCAD, ReactOS, Duke University, The United Nations (UNESCO)...

I'm relatively certain if this were as simple as "just set a sensible rate limit and the crawlers will stop DDOS'ing your site" one person at one of these organizations would have figured that out by now. I don't think they're all doing it because they really love anime catgirls.

reply
skydhash
1 day ago
[-]
Or use CDN caching. That's one of the reasons they here for.
reply
bakugo
1 day ago
[-]
robots.txt is obviously only effective against well-behaved bots. OpenAI etc are usually well behaved, but there's at least one large network of rogue scraping bots that ignores robots.txt, fakes the user-agent (usually to some old Chrome version) and cycles through millions of different residential proxy IPs. On my own sites, this network is by far the worst offender and the "well-behaved" bots like OpenAI are barely noticeable.

To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.

Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):

  ip.src.continent in {"AF" "SA"} or
  ip.src.country in {"CN" "HK" "SG"} or
  ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
  ip.src.asnum in {28573 45899 55836}
reply
GuB-42
1 day ago
[-]
Surely there are solutions more subtle than blocking 80% of the world population...
reply
sumtechguy
1 day ago
[-]
is there an http code for 'hey I gave you this already 10 times. This is a you problem not a me problem I refuse to give you another copy'.

It also sounds like there is an opportunity to sell scraped data to these companies. Instead of 10 crawlers we get one crawler and they just resell/give it away. More honey pots doesnt really fix the root cause (which is greed).

reply
GuB-42
1 day ago
[-]
> is there an http code for 'hey I gave you this already 10 times.

429 Too Many Requests

> This is a you problem not a me problem

That's the "4" in "429"

reply
bakugo
1 day ago
[-]
I should've made it clear that it's not a block rule, just a challenge rule. Those people can still access the website, they just have to go through the "checking your browser" page that you're probably familiar with.

As I said, you can just enable that for everyone and be done with it, but with a custom rule, you can avoid showing it to people that are unlikely to be bots.

reply
BrenBarn
14 hours ago
[-]
One thing I don't fully understand in all this is how the IP address stuff works. Like I keep hearing people saying somebody can get 10 gazillion residential IPs so they become unblockable, but how? This article also mentions crawlers should publish there IP ranges. Like, yeah? What if using more than X number of IPs to crawl was a criminal offense unless you got a permit, which would require you to identify and publish all those IPs up front?
reply
defrost
14 hours ago
[-]
> But How?

Typically by hiring them .. although their provenance might be sketchy..

eg: https://www.cybersecuritydive.com/news/us-charges-oregon-man...

^ Oregon man arrested for running ~70,000 device DDOS-for-Hire botnet; the distributed attacking computers were mostly compromised IoT gadgets - fridges routers, toasters, doorbells, etc. with weak security that were probably scanned and p0wned via Shodan (or similar device mapping project).

More legally there are many "free software" deals that offer services for people via installed software that comes with a side order of background web crawling in the fine print of the WALL-O'-TEXT Terms Of Uses agreement.

Enterprising middle people gather up bots and offer them for hire to web crawlers, large scale companies will farm their own bots via their existing user base.

reply
BrenBarn
12 hours ago
[-]
Okay, but are OpenAI and Meta straight up buying botnets on the black market?
reply
diggan
8 hours ago
[-]
OpenAI i'm not so sure about, but since Meta already got caught downloading copyrighted material to train LLMs, I think it isn't far fetched for them to also use borderline illegal methods for acquiring IPs to use.
reply
defrost
12 hours ago
[-]
Unlikely.

There are many ways, at their scale they (Meta at least) probably have edge servers in ISP's across the planet and can easily mix their crawlers with residential IP addresses rotationally assigned by domestic ISP's they co-mingle with.

Or some other way, legal but somewhat obfuscated.

reply
xrd
1 day ago
[-]
Isn't there a class action lawsuit coming from all this? I see a bunch of people here indicating these scrapers are costing real money to people who host even small niche sites.

Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?

reply
EgregiousCube
1 day ago
[-]
Under what law? It's interesting because these are sites that host content for the purpose of providing it to anonymous network users. ebay won a case against a scraper back in 2000 by claiming that the server load was harming them, but that reasoning was later overturned because it's difficult to say that server load is actual harm. ebay was in the same condition before and after a scrape.

Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?

reply
buttercraft
1 day ago
[-]
If you put measures in place to prevent someone from accessing a computer, and they circumvent those measures, is that not a criminal offense in some jurisdictions?
reply
integralid
1 day ago
[-]
On the other hand, DDoS attacks are pretty clearly on the illegal side. I wonder how this would play out in practice.
reply
Nextgrid
22 hours ago
[-]
Intention plays a part. (D)DoS is intentionally done to make a website unavailable to legitimate users. Scraping may do this as a side-effect (if you are incompetent and/or use the "cloud"), but isn't the intention.
reply
outside1234
1 day ago
[-]
Yes. There are one set of rules for us and another set of rules for anything with more than a billion dollars.
reply
neilv
1 day ago
[-]
Don't the companies in the headlines pay big bucks for people working on "AI"?

Maybe they are paying big bucks for people who are actually very bad at their jobs?

Why would the CEOs tolerate that? Do they think it's a profitable/strategic thing to get away with, rather than a sign of incompetence?

When subtrees of the org chart don't care that they are very bad at their jobs, harmed parties might have to sue to get the company to stop.

reply
levleontiev
1 day ago
[-]
That's why I am building a Rate Limiter as a service. Seems that it has its niche.
reply
s_ting765
1 day ago
[-]
Can confirm META's bots aggressively scraping some of my internet-facing services have but they do respect robots.txt.
reply
exasperaited
1 day ago
[-]
Xe Iaso is my spirit animal.

> "I don't know what this actually gives people, but our industry takes great pride in doing this"

> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"

> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."

<3 <3

reply
sct202
1 day ago
[-]
I wonder how much of the rapid expansion of datacenters is from trying to support bot traffic.
reply
loeg
1 day ago
[-]
In terms of CapEx, not much. The GPUs are much more expensive. Physical footprint? I don't know.
reply
jasoncartwright
1 day ago
[-]
I recently, for pretty much the first time ever in 30 years of running websites, had to blanket ban crawlers. I now whitelist a few, but the rest (and all other non-UK visitors) have to pass a Cloudflare challenge [1].

AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.

Really disappointing.

[1] https://developers.cloudflare.com/cloudflare-challenges/

reply
breakyerself
1 day ago
[-]
There's so much bullshit on the internet how do they make sure they're not training on nonsense?
reply
bgwalter
1 day ago
[-]
Much of it is not training. The LLMs fetch webpages for answering current questions, summarize or translate a page at the user's request etc.

Any bot that answers daily political questions like Grok has many web accesses per prompt.

reply
snowwrestler
1 day ago
[-]
While it’s true that chatbots fetch information from websites in response to requests, the load from those requests is tiny compared to the volume of requests indexing content to build training corpuses.

The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.

Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.

reply
shikon7
1 day ago
[-]
But surely there aren't thousands of new corpuses built every minute.
reply
bgwalter
1 day ago
[-]
Why would the Register point out Meta and OpenAI as the worst offenders? I'm sure they do not continuously build new corpuses every day. It is probably the search function, as mentioned in the top comments.
reply
snowwrestler
7 hours ago
[-]
It says in the first sentence of the article that it is 80% bots (crawlers) and only 20% fetchers.

Of course they are crawling every day to improve their training data. The goal is LLMs that know everything, but “everything” changes on a daily basis.

Meta and OpenAI are simply the largest after Google, but Google has had ~20 more years to learn how to politely operate crawlers at full-Internet scale.

reply
8organicbits
1 day ago
[-]
Is an AI chatbot fetching a web page to answer a prompt a 'web scraping bot'? If there is a user actively promoting the LLM, isn't it more of a user agent? My mental model, even before LLMs, was that a human being present changes a bot into a user agent. I'm curious if others agree.
reply
bgwalter
1 day ago
[-]
The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.

I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.

reply
ronsor
1 day ago
[-]
> The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.

So does my browser when I have uBlock Origin enabled.

reply
danaris
1 day ago
[-]
But they're (generally speaking) not being asked for the contents of one specific webpage, fetching that, and summarizing it for the user.

They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.

Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.

reply
prasadjoglekar
1 day ago
[-]
By paying a pretty penny for non bullshit data (Scale Ai). That along with Nvidia are the shovels in this gold rush.
reply
danny_codes
1 day ago
[-]
Making a lot of assumptions about the quality of scale AI.
reply
danaris
1 day ago
[-]
I mean...they don't. That's part of the problem with "AI answers" and such.
reply
p3rls
23 hours ago
[-]
I am under attack mode right now because I am being attacked from hundreds of chinese proxied bots taking my PHP response time from its usual 0.2 to 2+second response times. fucking ridiculous.
reply
vkou
1 day ago
[-]
Why is this not a violation of the CFAA, and why aren't SWEs and directors going to prison over it?

As long as I have an EULA or a robots.txt or even a banner that forbids this sort of access, shouldn't any computerized access be considered abuse? Something, something, scraping JSTOR?

reply
okasaki
1 day ago
[-]
I wonder if we're doing the wrong thing blocking them with invasive tools like cloudflare?

If all you're concerned about is server load, wouldn't it be better to just offer a tar file containing all of your pages they can download instead? The models are months out of date, so a monthly dumb would surely satisfy them. There could even be some coordination for this.

They're going to crawl anyway. We can either cooperate or turn it into some weird dark market with bad externalities like drugs.

reply
masfuerte
1 day ago
[-]
A tar file would be better if the crawlers would use it, but even sites with well-publicised options for bulk downloads (like wikipedia) are getting hammered by the bots.

The bot operators DNGAF.

reply
HankStallone
6 hours ago
[-]
Right. I don't care if AI (or anything else) indexes or learns from my sites. That's what they're there for. But yesterday I blocked an IP that hit one of my sites 82000 times in an hour, or 22/second. And apparently it's a very stupid bot, because it kept redownloading CSS and other asset files every time it saw a link to them.

There's no way the people behind that bot are going to follow any suggestions to make it behave better. After all, adding things like caching and rate-limiting to your web crawler might take a few hours, and who's got time for that.

reply
recallingmemory
1 day ago
[-]
Yeah, I am in the opposing camp too - I don't use Cloudflare's bot fight tooling on any of our high traffic websites. I'm not seeing the issue with allowing bots to crawl our websites other than some additional spend for bandwidth. Agent mode is pretty powerful when paired with a website that cooperates, and if people want to use AI to interact with our data then what's wrong with that?
reply
zzo38computer
1 day ago
[-]
I also do not like Cloudflare.

If the crawlers were aware of these archive files, and would be willing to use it, then that would help, but it isn't. (It would also help to know which dynamic files are worthless for archiving and mirroring, but they will often ignore that.)

reply
lostmsu
1 day ago
[-]
This article and the "report" look like a submarine ad for Fastly services. At no point does it mention the human/bot/AI bot ratio, making it useless for any real insights.
reply
delfinom
1 day ago
[-]
I run a symbol server, as in, PDB debug symbol server. Amazon's crawler and a few others love requesting the ever loving shit out of it for no obvious reason. Especially since the files are binaries.

I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.

reply
ack_complete
1 day ago
[-]
I have a simple website consisting solely of static webpages pointing to a bunch of .zip binaries. Nothing dynamic, all highly cacheable. The bots are re-downloading the binaries over and over. I can see Bingbot downloading a .zip file in the logs, and then an hour later another Bingbot instance from a different IP in the same IP range downloading the same .zip file in full. These are files that were uploaded years ago and have never retroactively changed, and don't contain crawlable contents within them (executable code).

Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.

reply
jgalt212
1 day ago
[-]
about 18 months ago, our non-Google / Bing bot traffic went from single digits per cent to over 99.9% bot traffic. We tried some home-spun solutions at first, but eventually threw in the towel and put Cloudflare in front of all our publicly accessible pages. On a long term basis, this was probably the right move for us, but we felt forced into this. And the Cloudflare Managed Ruleset definitely blocks some legit traffic such that it requires a fair amount of manual tuning.
reply
greatgib
1 day ago
[-]
That's the moment that you remember that years ago in self hosted you could have sustained millions request per second on a single low end server for quite nothing.

But now you are "on the cloud", with lambdas because "who cares" and hiring a proper part-time sysadmin is too complicated and so now you are pounded with crazy costs for moderate loads...

reply
hereme888
1 day ago
[-]
I'm absolutely pro AI-crawlers. The internet is so polluted with garbage, compliments of marketing. My AI agent should find and give me concise and precise answers.
reply
mrweasel
1 day ago
[-]
They just don't need to hammer sites into the ground to do it. This wouldn't be an issue if the AI companies where a bit more respectful of their data sources, but they are not, they don't care.

All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.

Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"

reply
lionkor
1 day ago
[-]
The second I get hit with bot traffic that makes my server heat up, I would just slam some aggressive anti bot stuff infront. Then you, my friend, are getting nothing with your fancy AI agent.
reply
mediumsmart
1 day ago
[-]
so the fancy AI agent will have to get really fancy and mimic human traffic and all is good until the server heats up from all those separate human trafficionados - then what?
reply
depingus
1 day ago
[-]
The end of the open web. That's what.

Sites will have to either shutdown or move behind a protection racket run by one of the evil megacorps. And TBH, shutting down is the better option.

With clickthru traffic dead, whats even the point of putting anything online? To feed AIs so that someone else can profit at my (very literal) expense? No thanks. The knowledge dies with me.

The internet dark age is here. Everyone, retreat to your fiefdom.

reply
lionkor
1 day ago
[-]
Nobody is forcing anyone to share their knowledge. What then? Dead internet.
reply
depingus
1 day ago
[-]
Absolutely yes. I guarantee you these megacorps are betting on a future where the open internet has been completely obliterated. And the only way to participate online is thru their portal; where everything you do feeds back into their AI. Because that is the only way to acquire fresh food for their beast.
reply
hereme888
1 day ago
[-]
I've never ran any public-facing servers, so maybe I'm missing the experience of your frustration. But mine, as a "consumer" is wanting clean answers, like what you'd expect when asking your own employee for information.
reply