I'm building infrastructure for AI agents and kept running into the same problem: before an agent fetches a URL, there's no easy way to know what's allowed. There are now 8 different standards - robots.txt, llms.txt, ai.txt, TDMRep, Cloudflare Content Signals, and others - all saying different things in different formats. No one checks all of them. Most agents check zero.
So I decided to actually measure the problem. I crawled the Tranco top 1M domains over 10 days in February 2026, parsing every known AI policy signal. Failure rate was 0.07% (697 domains out of 1M).
What surprised me most:
- 90% of domains have zero AI-specific signals. Not "they block everything" - they literally say nothing. Most robots.txt files just have generic /admin/ or /wp-login/ rules from a decade ago.
- When sites DO block, it's almost always a blanket decision. 58,791 domains block both GPTBot and ClaudeBot. Only 9,888 block GPTBot alone. The "nuanced policy" that regulators imagine basically doesn't exist.
- Cloudflare sites block AI at 2.3x the baseline rate. Not because their owners care more - because Cloudflare shipped a one-click toggle in July 2024. The tooling creates the behavior.
- TDMRep adoption: 37 out of 1 million. That's the W3C protocol specifically designed for the EU Copyright Directive's TDM opt-out. Caveat: our detection covers the well-known path and HTTP headers, not HTML meta tags on subpages – actual adoption among European publishers is likely higher. We note this in the methodology.
- The ToS gap is the finding I think matters most. We scanned 79K Terms of Service pages. 7,575 domains prohibit crawling or AI training in their ToS but have zero AI-specific robots.txt rules. YouTube, Discord, Substack, Target - an agent checking only robots.txt sees "no policy" while the site's legal terms explicitly say stop.
- 6,317 domains contradict themselves across standards - e.g., blocking GPTBot in robots.txt but setting search=yes in Content Signals.
This is the first public output from a project called Maango, which is building a registry and API to check any domain's AI policy across all 8 standards in one call. The report is free and the methodology is documented in full.
Happy to answer questions about the data, methodology, or the agent compliance space generally.