AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
[0]: https://drewdevault.com/2021/04/26/Cryptocurrency-is-a-disas...
[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
No, because there is no such thing, at least not as understood by Garrett Hardin, who put forward the phrase.
Commons fail when selfish, greedy people subvert or destroy the governance structures that help control them. If those governance structures exist (and they do for all historical commons) and continue to exist, the commons suffers no tragedy.
This recent slide deck talks about Ostrom's ideas on this, which even Hardin eventually conceded were correct, and that his diagnosis of a "tragedy of the commons" does not actually describe the historical processes by which commons are abused.
https://dougwebb.site/slides/commons
That said ... arguably there is a problem here with a "commons" that does in fact lack any real governance structure.
> The standard, developed in 1994, relies on voluntary compliance [0]
It was conceived in a world with an expectation of collectively respectful behaviour: specifically that search crawlers could swamp "average Joe's" site but shouldn't.
We're in a different world now but companies still have a choice. Some do still respect it... and then there's Meta, OpenAI and such. Communities only work when people are willing to respect community rules, not have compliance imposed on them.
It then becomes an arms race: a reasonable response from average Joe is "well, OK, I'll allows anyone but [Meta|OpenAI|...] to access my site. Fine in theory, dificult in practice:
1. Block IP addresses for the offending bots --> bots run from obfuscated addresses
2. Block the bot user agent --> bots lie about UA.
...and so on.
If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.
How does "proper UA string" solve this "blowing up websites" problem
The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour
"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure
"Spoofed" UA strings have been facilitated and expected since the earliest web browsers
For example,
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically
It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting
NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure
How would UA string help
For example, a crawler making "strange" requests can send _any_ UA string, and a crawler doing "normal" requests can also send _any_ UA string.
The "doing requests" is what I refer to as "behaviour"
A website operator might think "Crawlers making strange requests send UA string X but not Y"
Let's assume the "strange" requests cause a "website load" problem^1
Then a crawler, or any www user, makes a "normal" request and sends UA string X; the operator blocks or redirects the request, unnecessarily
Then a crawler makes "strange" request and sends UA string Y; the operator allows the request and the website "blows up"
What matters for the "blowing up websites" problem^1 is behaviour, not UA string
1. The article's title calls it the "blowing up websites" problem, but the article text calls it a problem with "website load". As always the details are missing. For example, what is the "load" at issue. Is it TCP connections or HTTP requests. What number of simultaneous connections and/or requests per second is acceptable, what number is not unacceptable. Again, behaviour is the issue, not UA string
The acceptable numbers need to be published; for example, see documentation for "web APIs"
Unless the rate is exceeded, the limit is not being avoided
"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."
Assuming the rate limit is more than one or a few requests every 24h this would be complying with the limit, not avoiding it
It could be that sometimes the problem website operators are concerned about is not "website load", i.e., the problem the article is discussing, it is actually something else (NB. I am not speculating about this particular operator, I am making a general observation)
If a website is able to fulfill all requests from unique IPs without affecting quality of service, then it stands to reason "website load" is not a problem the website operator is having
For example, the article's title claims Meta is amongst the "worst offenders" of creating excessive website load caused by "AI crawlers, fetchers"
Meta has been shown to have used third party proxy services wth rotating IP addresses in order to scrape other websites; it also sued one of these services because it was being used to scrape Meta's website, Facebook
https://brightdata.com/blog/general/meta-dismisses-claim-aga...
Whether the problem that Meta was having with this "scraping" was "website load" is debatable; if the requests were being fulfilled without affecting QoS, then arguably "website load" was not a problem
Rate-limiting addresses the problem of website load; it allows website operators to ensure that requests from all IP addresses are adequately served as opposed to preferentially servicing some IP addresses to the detriment of others (degraded QoS)
Perhaps some website operators become concerned that many unique IP addresses may be under the control of a single entity, and that this entity may be a competitor; this could be a problem for them
But if their website is able to fulfill all the requests it receives without degrading QoS then arguably "website load" is not a problem they are having
NB. I am not suggesting that a high volume of requests from a single entity, each complying with a rate-limit is acceptable, nor am I making any comment about the practice of "scraping" for commercial gain. I am only commenting about what rate-limiting is designed to do and whether it works for that purpose
The same incentives to do this already existed for search engine operators.
And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.
This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.
I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.
The problem in this case is the near complete protection from legal liability that corporate structures give to the people behaving badly. Like how Coca Cola can get away with killing people (https://prospect.org/features/coca-cola-killings/) but a person can't, if you want to keep the firearms analogy going. But it's a bad analogy because the firearms as tool actually at least are involved in the bad (and good) actions. AI itself isn't even involved in the HTTP requests and probably isn't even running on the same premises.
> This isn't AI damaging anything. This is corporations damaging things
This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.
Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.
You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!
Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.
But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.
I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.
If serving traffic for free is a problem, don't. If you are only able to serve N requests per second/minute/day/etc, do that. But don't complain if you give out something for free and people take it.
(also, a lot of the numbers people quote during these AI scraper "attacks" are very tame and the fact they are branded as problematic makes me suspect there's substantial incompetence in the solutions deployed to serve them)
These scrapers have upped both the server load (requests per second) and bandwidth requirements, without me consenting to it. If they were actual human users OR bots that were appropriately designed to minimize their impact on the target sites, that's perfectly OK.
Maybe if this was truly the only way to get to our god-like LLM to work in a god-like way (*), it would also be acceptable. But it isn't.
And on top of that, they are incompetently designed and they are causing real issues that a huge number of sites need to address.
(*) put differently, if all this current scraping activity delivered some notable benefit to humanity
What’s the difference between giving 900K meals to one person and feeding 900K people? The former is being abusive, wasteful, and depriving almost 900K other people of food. They are also being deceitful by pretending to be 900K different people.
Resources are finite. Web requests aren’t food, but you still pay for them. A spike in traffic may mean your service being down for the rest of the month, which is more acceptable if you helped a bunch of people who have now learned about and can talk about and share what you provided, versus having wasted all your traffic on a single bad actor who didn’t even care because they were just a robot.
> makes me suspect there's substantial incompetence in the solutions deployed to serve them
So you see bots scraping the Wikipedia webpages instead of downloading their organised dump, or scraping every git service webpage instead of cloning a repo, and think the incompetence is with the website instead of the scraper wasting time and resources to do a worse job?
IP address (presumably after too many visits) ? So now the iptables mechanism has to scale to fit your business model (of hammering my git repository 1 commit at a time from nearly a million IP addresses) ? Why does the code I use have to fit your braindead model? We wouldn't care if you just used git clone, but you're too dumb to do that.
The URL? Legitimate human (or other) users won't be happy about that.
Our web-fronted git repo is not part of our business model. It's just a free service we like to offer people, unrelated to revenue flow or business operations. So your behavior is not screwing my business model, but it is screwing up people who for whatever reason want to use that service, who can no longer use the web-fronted git repo.
ps. I've used "you" throughout the above because you used "my". No idea if you personally are involved in any such behavior.
Anytime somebody writes "just" you immedially can understand that they have no idea what they are talking about.
I think the ISPs serving these requests are probably going to have to start going after customers for being abusive in order for this to stop.
Such is life.
The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.
People send me spam. I don't whine about it. I block it.
Obviously I’m talking about the people behind them, and I very much doubt you lack the minimal mental acuity to understand that when I used “website owners” in the preceding sentence. If you don’t want to engage in a good faith discussion you can just say so, no need to waste our time with fake pedantry. But alright, I edited that section.
> You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic.
And then you spend all your time in a game of cat and mouse, while these scrappers bring your website down and cost you huge amounts of money. Are you incapable of understanding how that is a problem?
> People send me spam. I don't whine about it. I block it.
Is the amount of spam you get so overwhelming that it swamps your inbox every day to a level you’re unable to find the real messages? Do those spammers routinely circumvent your rules and filters after you’ve blocked them? Is every spam message you get costing you money? Are they increasing every day? No? Then it’s not the same thing at all.
10/10. No notes.
I wish there were a better way to solve this.
I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.
https://github.com/pinballmap/pbm/blob/302ac638850711878ac61...
https://github.com/pinballmap/pbm/blob/302ac638850711878ac61...
but it only bans for 3 hours. if they don't respect the hidden link and robots.txt they get banned.
The it's the permanent iptables rule, but could be CF API call as well.
It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
Even Googlebot has to be told to not crawl particular querystrings, but the AI crawlers are worse.
I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?
e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
Thankfully search engines started ignoring quotes years ago, so it balances out...
We've done a few things since then:
- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.
- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.
- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.
- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.
Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.
Anubis's creator says the same thing:
> In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
What's wrong with crawlers? That's how google finds you, and people find you on google.
Just put some sensible request limits per hour per ip, and be done.
I have no personal experience, but probably worth reading like... any of the comments where people are complaining about these crawlers.
Claims are that they're: ignoring robots.txt; sending fake User-Agent headers; they're crawling from multiple IPs; when blocked they will use residential proxies.
People who have deployed Anubis to try and address this include: Linux Kernel Mailing List, FreeBSD, Arch Linux, NixOS, Proxmox, Gnome, Wine, FFMPEG, FreeDesktop, Gitea, Marginalia, FreeCAD, ReactOS, Duke University, The United Nations (UNESCO)...
I'm relatively certain if this were as simple as "just set a sensible rate limit and the crawlers will stop DDOS'ing your site" one person at one of these organizations would have figured that out by now. I don't think they're all doing it because they really love anime catgirls.
To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
ip.src.continent in {"AF" "SA"} or
ip.src.country in {"CN" "HK" "SG"} or
ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
ip.src.asnum in {28573 45899 55836}
It also sounds like there is an opportunity to sell scraped data to these companies. Instead of 10 crawlers we get one crawler and they just resell/give it away. More honey pots doesnt really fix the root cause (which is greed).
429 Too Many Requests
> This is a you problem not a me problem
That's the "4" in "429"
As I said, you can just enable that for everyone and be done with it, but with a custom rule, you can avoid showing it to people that are unlikely to be bots.
Typically by hiring them .. although their provenance might be sketchy..
eg: https://www.cybersecuritydive.com/news/us-charges-oregon-man...
^ Oregon man arrested for running ~70,000 device DDOS-for-Hire botnet; the distributed attacking computers were mostly compromised IoT gadgets - fridges routers, toasters, doorbells, etc. with weak security that were probably scanned and p0wned via Shodan (or similar device mapping project).
More legally there are many "free software" deals that offer services for people via installed software that comes with a side order of background web crawling in the fine print of the WALL-O'-TEXT Terms Of Uses agreement.
Enterprising middle people gather up bots and offer them for hire to web crawlers, large scale companies will farm their own bots via their existing user base.
There are many ways, at their scale they (Meta at least) probably have edge servers in ISP's across the planet and can easily mix their crawlers with residential IP addresses rotationally assigned by domestic ISP's they co-mingle with.
Or some other way, legal but somewhat obfuscated.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?
Maybe they are paying big bucks for people who are actually very bad at their jobs?
Why would the CEOs tolerate that? Do they think it's a profitable/strategic thing to get away with, rather than a sign of incompetence?
When subtrees of the org chart don't care that they are very bad at their jobs, harmed parties might have to sue to get the company to stop.
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.
Really disappointing.
[1] https://developers.cloudflare.com/cloudflare-challenges/
Any bot that answers daily political questions like Grok has many web accesses per prompt.
The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
Of course they are crawling every day to improve their training data. The goal is LLMs that know everything, but “everything” changes on a daily basis.
Meta and OpenAI are simply the largest after Google, but Google has had ~20 more years to learn how to politely operate crawlers at full-Internet scale.
I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
So does my browser when I have uBlock Origin enabled.
They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
As long as I have an EULA or a robots.txt or even a banner that forbids this sort of access, shouldn't any computerized access be considered abuse? Something, something, scraping JSTOR?
If all you're concerned about is server load, wouldn't it be better to just offer a tar file containing all of your pages they can download instead? The models are months out of date, so a monthly dumb would surely satisfy them. There could even be some coordination for this.
They're going to crawl anyway. We can either cooperate or turn it into some weird dark market with bad externalities like drugs.
The bot operators DNGAF.
There's no way the people behind that bot are going to follow any suggestions to make it behave better. After all, adding things like caching and rate-limiting to your web crawler might take a few hours, and who's got time for that.
If the crawlers were aware of these archive files, and would be willing to use it, then that would help, but it isn't. (It would also help to know which dynamic files are worthless for archiving and mirroring, but they will often ignore that.)
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.
But now you are "on the cloud", with lambdas because "who cares" and hiring a proper part-time sysadmin is too complicated and so now you are pounded with crazy costs for moderate loads...
All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.
Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"
Sites will have to either shutdown or move behind a protection racket run by one of the evil megacorps. And TBH, shutting down is the better option.
With clickthru traffic dead, whats even the point of putting anything online? To feed AIs so that someone else can profit at my (very literal) expense? No thanks. The knowledge dies with me.
The internet dark age is here. Everyone, retreat to your fiefdom.