Like low, medium, high, xhigh and so on.
But are they different models underneath? Or same model with different parameter?
The reason I ask is because, if I change the effort param mid conversation in Claude code, I get a warning suggesting I’m breaking the cache.
I don’t think this happens in Codex because when I change the effort, the responses are still quick.
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations