I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...
Nice to see them support it officially; however, OpenAI has officially supported this for a while but, at least historically, I have been unable to use it because it adds deterministic validation that errors on certain standard JSON Schema elements that we used. The lack of "official" support is the feature that pushed us to use Claude in the first place.
It's unclear to me that we will need "modes" for these features.
Another example: I used to think that I couldn't live without Claude Code "plan mode". Then I used Codex and asked it to write a markdown file with a todo list. A bit more typing but it works well and it's nice to be able to edit the plan directly in editor.
Agree or Disagree?
I would hope that this is not what OpenAI/Anthropic do under the hood, because otherwise, what if one of the strings needs a lot of \escapes? Is it also supposed to newer write actual newlines in strings? It's awkward.
The ideal solution would be to have some special tokens like [object_start] [object_end] and [string_start] [string_end].
My favorite one is going through the plan interactively. It turns it into a multiple choice / option TUI and the last choose is always reprompt that section of the plan.
I had switch back to codex recently and not being able to do my planning solely in the CLI feels like the early 1900s.
To trigger the interactive mode. Do something like:
Plan a fix for:
<Problem statement>
Please walk me through any options or questions you might have interactively.
I think the new feature goes on to limit which token can be output, which brings a guarantee, where the tools are a suggestion.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Reliability is the final frontier!
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
But also Gemini supports contrained generation which can't fail to match a schema, so why not use that instead of prompting?
Even asking for JSON (without constrained sampling) sometimes degrades output, but also even the name and order of keys can affect performance or even act as structured thinking.
At the end of the day current models have enough problems with generalization that they should establish a baseline and move from there.
IMO this was the more elegant design if you think about it: tool calling is really just structured output and structured output is tool calling. The "do not provide multiple ways of doing the same thing" philosophy.
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
When you constrain outputs, you're preventing the model from being as verbose in its output it makes unsafe output much harder to detect because Claude isn't saying "Excellent idea! Here's how to make a bomb:"
https://github.com/guidance-ai/llguidance
Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.
So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)
JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.
JSON schema is okay so long as it's generated for you, but I'd rather write something human readable and debuggable.
Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'
https://docs.claude.com/en/docs/agents-and-tools/tool-use/im...
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
This trick has also been in llama.cpp for a couple of years: https://til.simonwillison.net/llms/llama-cpp-python-grammars
I would have suspected it too, but I’ve been struggling with OpenAI returning syntactically invalid JSON when provided with a simple pydantic class (a list of strings), which shouldn’t be possible unless they have a glaring error in their grammar.
No, I don't get refusals, I see literally invalid json, like: `{"field": ["value...}`
> 2025-05-20 LLGuidance shipped in OpenAI for JSON Schema
[0] https://platform.openai.com/docs/guides/function-calling#lar... [1] https://github.com/guidance-ai/llguidance
class FooBar(BaseModel): foo: list[str] bar: list[int]
prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """
response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )
TypeError: Completions.parse() got an unexpected keyword argument 'strict'
class FooBar(BaseModel): foo: list[str] bar: list[int]
prompt = """#Task Your job is to reply with Foo Bar, a json object with foo, a list of strings, and bar, a list of ints """
response = openai_client.chat.completions.parse( model="gpt-5-nano-2025-08-07", messages=[{"role": "system", "content": FooBar}], max_completion_tokens=4096, seed=123, response_format=CommentAnalysis, strict=True )
> TypeError: Completions.parse() got an unexpected keyword argument 'strict'
I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.
It generally happens when the grammar is highly constrained, for example if a boolean is expected next.
If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.
It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.
Imagine you're asking your model to give you a list of tasks mentioned in a meeting, along with a boolean indicating whether the task is done. If you put the boolean first, the model must decide both what the task is and whether it is done at the same time. If you put the task description first, the model can separate that work into two distinct steps.
There are more tricks like this. It's really worth thinking about which calculations you delegate to the model and which you do in code, and how you integrate the two.
A quick look at the llguidance repo doesn't show any signs of Anthropic contributors, but I do see some from OpenAI and ByteDance Seed.
It's a bit weird it took Anthropic so long considering it's been ages since OpenAI and Google did it I know you could do it through tool calling but that always just seemed like a bit of a hack to me
Anthropic seems to be following suit.
(I'm probably just bitter because they owe me $50K+ for stealing my books).