FilterHN

We crawled 1M domains to map AI agent permissions – 90% have no policy

2 points

by mehula

1 hour ago

| past

| 2 comments

| maango.io

| HN

▲

mehula

1 hour ago

[-]

Hey HN - I built this.

I'm building infrastructure for AI agents and kept running into the same problem: before an agent fetches a URL, there's no easy way to know what's allowed. There are now 8 different standards - robots.txt, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, and others - all saying different things in different formats. No one checks all of them. Most agents check zero.

So I decided to actually measure the problem. I crawled the Tranco top 1M domains over 10 days in February 2026, parsing every known AI policy signal. Failure rate was 0.07% (697 domains out of 1M).

What surprised me most:

- 90% of domains have zero AI-specific signals. Not "they block everything" - they literally say nothing. Most robots.txt files just have generic /admin/ or /wp-login/ rules from a decade ago.

- When sites DO block, it's almost always a blanket decision. 58,791 domains block both GPTBot and ClaudeBot. Only 9,888 block GPTBot alone. The "nuanced policy" that regulators imagine basically doesn't exist.

- Cloudflare sites block AI at 2.3x the baseline rate. Not because their owners care more - because Cloudflare shipped a one-click toggle in July 2024. The tooling creates the behavior.

- TDMRep adoption: 37 out of 1 million. That's the W3C protocol specifically designed for the EU Copyright Directive's TDM opt-out. Caveat: our detection covers the well-known path and HTTP headers, not HTML meta tags on subpages – actual adoption among European publishers is likely higher. We note this in the methodology.

- The ToS gap is the finding I think matters most. We scanned 79K Terms of Service pages. 7,575 domains prohibit crawling or AI training in their ToS but have zero AI-specific robots.txt rules. YouTube, Discord, Substack, Target - an agent checking only robots.txt sees "no policy" while the site's legal terms explicitly say stop.

- 6,317 domains contradict themselves across standards - e.g., blocking GPTBot in robots.txt but setting search=yes in Content Signals.

This is the first public output from a project called Maango, which is building a registry and API to check any domain's AI policy across all 8 standards in one call. The report is free and the methodology is documented in full.

Happy to answer questions about the data, methodology, or the agent compliance space generally.

▲

throwawayffffas

1 hour ago

[-]

I think most startups policy is "We have professional indemnity insurance that covers our use of AI agents".

▲

mehula

1 hour ago

[-]

Insurance covers the lawsuit. It doesn't un-scrape the content. Haha. It's better to stay compliant from day 1