Amazonbot is finally respecting robots.txt
139 points
by xena
5 hours ago
| 9 comments
| xeiaso.net
| HN
phdelightful
3 hours ago
[-]
I just put Anubis in front of my self-hosted forge this morning because AmazonBot had helped itself to 750 GiB (!) of traffic to my public repos this month!

At least, it claimed to be AmazonBot…

reply
Bender
2 hours ago
[-]
Are they in this space? [1] One could map the ranges into a web daemon and rate limit them or just 'ip route add blackhole ${cidr}' each cidr block.

[1] - https://ip-ranges.amazonaws.com/ip-ranges.json

reply
nathanmills
3 hours ago
[-]
Do you have a robots.txt?
reply
xena
3 hours ago
[-]
> We are writing to inform you that starting Monday, June 15, 2026, crawl preferences for Amazonbot will be managed solely through the industry-standard directives.

They will in the future, but not today.

reply
jacobn
4 hours ago
[-]
I just complained to them the other day! They were scraping our weather website to no end, very much including the disallowed path prefixes.

Did end up just adding them to our WAF blocklist, which is weirdly ironic - hosting on their infra & using their services to block their AI scraper...

reply
BLKNSLVR
4 hours ago
[-]
I hope you leave it on the WAF. If they're only just deciding to respect robots.txt, which has been internet infrastructure forever, then it's probably still incredibly amateur software with 'Amazon-priorities' rather than 'responsible internet traffic' priorities.
reply
adrianvi
2 hours ago
[-]
step 1: create the problem, step 2: sell the solution, step 3: profit
reply
bstsb
4 hours ago
[-]
> Get Outlook for Mac

this bit made me laugh. was the email drafted in Outlook? was it sent to some sort of forwarding mailbox, or did they just BCC every customer in?

reply
captn3m0
3 hours ago
[-]
Good place to ask, saw a new AWS User agent in logs today: Amazon-Quick-on-Behalf-of-$HEXID

I found a mention on some user agent trackers but no official documentation. Anyone knows if it’s documented? Asking because I am seeing decent traffic (30GB/week) from this.

reply
embedding-shape
3 hours ago
[-]
Came across this recently too, seems to be from "Amazon Quick" where crawling other's websites is basically a feature of the product: https://docs.aws.amazon.com/quick/latest/userguide/web-crawl...

> Crawling behavior [...] Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers.

Maybe people found a way of using it as a loophole for something or Amazon Quick is just picking up in usage, and your website is popular amongst whoever uses that sort of stuff.

reply
iLoveOncall
3 hours ago
[-]
Amazon Quick is the new name of Quicksight, which is the BI tool from AWS.

It has AI agents included so I guess this can just come from it searching the web based on user requests.

reply
TurdF3rguson
4 hours ago
[-]
Why does Amazonbot even exist, can someone explain? I don't understand why an ecommerce play would be crawling other websites.
reply
input_sh
4 hours ago
[-]
To train AI. Not even a hyperbole, that is the only concrete example they list in their explanation: https://developer.amazon.com/amazonbot

> Amazonbot is used to improve our products and services. This helps us provide more accurate information to customers and may be used to train Amazon AI models.

reply
tintor
4 hours ago
[-]
To ensure Amazon marketplace sellers aren't offering lower prices on other ecommerce websites. Also AI.
reply
b112
3 hours ago
[-]
I was wondering about this. And it makes me think this is all mistruth, unless they plan to drop this pricing tactic.

They've been getting some heat on it lately, but I find it hard to believe they're going to give up entirely? And if so, what's to stop someone from just flouting their rules on pricing, and then doing the robots.txt thing to prevent issues?

reply
embedding-shape
4 hours ago
[-]
Amazonbot is specifically the user agent they use for crawling for "provide more accurate information to customers" (whatever that means, could be anything it sounds like) and also when they scrape for data used in AI training, according to https://developer.amazon.com/amazonbot
reply
reaperducer
4 hours ago
[-]
AI. Gotta slurp the world.
reply
TrackerFF
1 hour ago
[-]
Is it just me, or is it extra unethical and self-serving when crawlers from say Amazon(Bot) decides to incessantly crawl AWS hosted websites? Same goes for Google and Microsoft crawlers crawling GC and Azure.

By that, I mean the types of crawls that can hog up significant usage.

reply
arjie
4 hours ago
[-]
Huh, I get a lot of traffic from Amazonbot (relative to humans) and try as I might, it would get stuck in a tarpit of no creation because it would sit there and keep blasting every variation of my recent pages because Mediawiki lists many links. I have them appropriately nofollow and warning the bot not to waste its time with robots.txt but it just goes and sticks itself on nonsense internal pages.

The traffic isn't a problem. I've got Cloudflare in front and the machine itself is relatively overpowered, and downtime isn't critical. But I'd just like the thing to be able to spider me properly. Someone did point out to me that maybe I wasn't receiving actual Amazonbot but some other spider: https://news.ycombinator.com/item?id=46352723

reply
namegulf
4 hours ago
[-]
Robots.txt is lame BTW, there is no way to enforce it. It is up to the bot to decide to crawl or not and most cases they don't care.

Cloudflare had a nice technic to address the bot problem (if you use their name servers). It'll respect and use the robots.txt while sending the remaining bots to a deep black hole.

reply
input_sh
4 hours ago
[-]
Yes, we know, its purpose is to guide the bots, not forcibly block them.

That said, one of the biggest websites in the world not respecting it is definitely a noteworthy story. Hopefully another one of the biggest websites in the world (formerly known as Twitter) eventually respects it as well instead of not even disclosing itself via a user agent and pretending to be Safari running on iOS.

reply
namegulf
3 hours ago
[-]
Why down vote a comment?

You're talking about one (yes, biggest) but millions of other bots don't follow must be a bigger story.

reply
marginalia_nu
2 hours ago
[-]
Robots.txt is great if you're trying to run an above board operation. Much easier than trying to guess how a webmaster wishes the crawler to behave, and then getting angry emails when you guess wrong.
reply
llbbdd
3 hours ago
[-]
Yeah, robots.txt is a great herald example of the type of solution invented by people who don't understand incentives whatsoever.
reply
vindin
2 hours ago
[-]
robots.txt is merely a gentleman’s courtesy at this point. Nobody is obligated to follow it.
reply
c-hendricks
2 hours ago
[-]
always_has_been.jpg
reply