> URL: <https://...docs...> What parameters does the Create Stream endpoint accept?
The answer that I would give is `name`, `description`, `retention_days`, and `tags`. What the answer sheet <https://agentreadingtest.com/answers.json> has is: `CANARY-TRUNC-10K-fox` ("Early in the page. All agents should find this."), `CANARY-TRUNC-40K-river`, `CANARY-TRUNC-75K-summit`, etc. These words appear on the page, but why would the LLM output include these? The first one appears before the API endpoint subpath specification, and the second in the middle of a word in the decryption. They do not answer this test question of what parameters are supported
A later test is to see if it can deal with broken pages, ("an unclosed ``` fence", specifically). Wouldn't it not echo those tokens if it can deal with seemingly erroneous strings on the page?
How is this test supposed to work?
Industry best practice + standard implementation for most agents right now is to do web browsing / fetching via subagents. Their output is summarized using a cheaper model and then passed back to the parent. It's very unlikely that without preserving the actual content the subagents see that the `CANARY-` strings would be found in the output.
Any thoughts on how you'd change the test structure with this in mind?
I chose to structure it this way intentionally because this is the finding. Most people are surprised that agents aren't 'seeing' everything that's there, and get frustrated when an agent says something isn't there when it clearly is. Raising awareness of this is one of the main points of the exercise, to me.
I think it describes generally how we can picture Claude and OpenAI working, but neglects further implementation details that are hard to see from their blog posts, ex. a web search vs. a web get tool.
(source: maintained a multi-provider x llama.cpp LLM client for 2.5+ years and counting)
My weighting system there scores the number of pages affected by SPA and caps the possible score at a "D" or "F" depending on the proportion of pages affected: https://afdocs.dev/interaction-diagnostics.html#spa-shells-i...
I've tried to weight things appropriately in assessing actual sites, but for the test here, I more wanted to just let people see for themselves what types of failures can occur.
Claude Web Opus 4.6 Extended: 14 / 20 points
x:CANARY-SPA-JSONLY-prism x:CANARY-CONNEG-MD-sigma
> It'd be nice to have a test harness: "Test my agent," to score them and give you benchmark score (like graphics cards, etc.). > Agent XYZ: reads only X% of the content it accesses.
I synced up with a colleague of mine who is testing the platform retrieval behaviors across platforms right now, and writing about them at: https://rhyannonjoy.github.io/agent-ecosystem-testing/
The info we have so far isn't consistent enough for a standardized benchmark, but it's on our radar to produce something like this in the future as we hone in on how to assess this more consistently, or at least how to compare outputs in a more standardized way.