We crawled 1M domains to map AI agent permissions – 90% have no policy
2 points
1 hour ago
| 2 comments
| maango.io
| HN
mehula
1 hour ago
[-]
Hey HN - I built this.

I'm building infrastructure for AI agents and kept running into the same problem: before an agent fetches a URL, there's no easy way to know what's allowed. There are now 8 different standards - robots.txt, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, and others - all saying different things in different formats. No one checks all of them. Most agents check zero.

So I decided to actually measure the problem. I crawled the Tranco top 1M domains over 10 days in February 2026, parsing every known AI policy signal. Failure rate was 0.07% (697 domains out of 1M).

What surprised me most:

- 90% of domains have zero AI-specific signals. Not "they block everything" - they literally say nothing. Most robots.txt files just have generic /admin/ or /wp-login/ rules from a decade ago.

- When sites DO block, it's almost always a blanket decision. 58,791 domains block both GPTBot and ClaudeBot. Only 9,888 block GPTBot alone. The "nuanced policy" that regulators imagine basically doesn't exist.

- Cloudflare sites block AI at 2.3x the baseline rate. Not because their owners care more - because Cloudflare shipped a one-click toggle in July 2024. The tooling creates the behavior.

- TDMRep adoption: 37 out of 1 million. That's the W3C protocol specifically designed for the EU Copyright Directive's TDM opt-out. Caveat: our detection covers the well-known path and HTTP headers, not HTML meta tags on subpages – actual adoption among European publishers is likely higher. We note this in the methodology.

- The ToS gap is the finding I think matters most. We scanned 79K Terms of Service pages. 7,575 domains prohibit crawling or AI training in their ToS but have zero AI-specific robots.txt rules. YouTube, Discord, Substack, Target - an agent checking only robots.txt sees "no policy" while the site's legal terms explicitly say stop.

- 6,317 domains contradict themselves across standards - e.g., blocking GPTBot in robots.txt but setting search=yes in Content Signals.

This is the first public output from a project called Maango, which is building a registry and API to check any domain's AI policy across all 8 standards in one call. The report is free and the methodology is documented in full.

Happy to answer questions about the data, methodology, or the agent compliance space generally.

reply
throwawayffffas
1 hour ago
[-]
I think most startups policy is "We have professional indemnity insurance that covers our use of AI agents".
reply
mehula
1 hour ago
[-]
Insurance covers the lawsuit. It doesn't un-scrape the content. Haha. It's better to stay compliant from day 1
reply