Claude Advanced Tool Use
620 points
1 day ago
| 52 comments
| anthropic.com
| HN
theknarf
9 hours ago
[-]
We should just build more CLI tools, that way the agentic AI can just run `yourtool --help` to learn how to use it. Instead of needing an MCP-server to access ex. Jira it should just call a cli tool `jira`. Better CLI tools for everything would help both AI and humans alike.
reply
tomashubelbauer
9 hours ago
[-]
This would be awesome, but great CLIs would have already been valuable prior to the age of LLMs and yet most services didn't ship one. I think it is because services like Jira and others do not want to be too open. Ultimately, despite the current LLM/MCP craze, I think this won't change and MCP tools will start getting locked down and nerfed somehow, the same way APIs have in not so recent memory after there being a bit of a craze around those a decade+ back.
reply
xnorswap
7 hours ago
[-]
I agree with your conclusion that this stuff will get locked down again over time.

I think there's also another major reason people don't like to ship desktop software, and that's the cost of support of dealing with outdated tools, it can be immense.

Ticket as raised, "Why is my <Product> broken?"

After several rounds of clarification, it's established they're using a 6-year old version that's hitting API endpoints that were first deprecated 3 years ago and finally now removed...

It's incredibly expensive to support multiple versions of products. On-prem / self host means you have to support several, but at least with web products it's expected they'll phone-home and nag to be updated and that there'll be someone qualified to do that update.

When you add runnable executable tooling, it magnifies the issue of how old that tooling gets.

Even with a support policy of not supporting versions older than <X>, you'll waste a lot of customer support time dealing with issues only for it to emerge it's out-dated software.

reply
pferde
7 hours ago
[-]
If that took "several rounds of clarification", then the support they're paying for is worthless. Getting version of the application should be among the first bits of information collected, possibly even required for opening the ticket.
reply
xnorswap
4 hours ago
[-]
You've never asked someone for a version and got back a version number for a completely different product?

Obviously it depends on your audience, and 3 rounds is exaggerating for the worst case, but in previous places I've worked I've seen customer support requests where the first question that needed to be asked wasn't, "What version are you using?", it's "Are you sure this is our product you're using?".

Actually getting version info out of that audience would have been at least an email explaining the exact steps, then possibly a follow up phone call to talk them through it and reassure them.

If your reference is JIRA tickets or you're selling software to software developers, then you're dealing with a heavily filtered stream. Ask your CS team for a look at the unfiltered incoming mail, it might be eye-opening if you've not done it before. You might be surprised just how much of their time is spent covering the absolute basics, often to people who have had the same support multiple times before.

reply
bradgessler
7 hours ago
[-]
A big problem with CLI tooling is it starts off seeming like it’s an easy problem to solve from a devs perspective. “I’ll just write a quick Go or Node app that consumes my web app’s API”

Fast forward 12-18 months, after several new features ship and several breaking API changes are made and teams that ship CLIs start to realize it’s actually a big undertaking to keep installed CLI software up-to-date with the API. It turns out there’s a lot of auto-updating infrastructure that has to be managed and even if the team gets that right, it can still be tricky managing which versions get deprecated vs not.

I built Terminalwire (https://terminalwire.com) to solve this problem. It replaces JSON APIs with a smaller API that streams stdio (kind of like ssh), and other commands that control browsers, security, and file access to the client.

It’s so weird to me how each company wants to ship their own CLI and auto-update infrastructure around it. It’s analogous to companies wanting to ship their own browser to consume their own website and deal with all the auto update infrastructure around that. It’s madness.

reply
linsomniac
4 hours ago
[-]
I've had good luck with having Claude write little CLI tools that interact with Jira: "cases" prints out a list of my in-progress cases (including immediately printing a cached list of cases, then querying Jira and showing any stragglers), "other_changes" shows me tickets in this release that are marked with "Other changes" label, "new_release" creates a new release in Jira, our deployment database, and a script to run the release, etc...

I could imagine a subagent that builds a tool on demand when it's needed.

Claude is really good at building small tools like these.

reply
ath92
9 hours ago
[-]
Nobody shipped this because previously almost nobody could use CLI tools. Now you can just ask an llm to generate the commands which makes things much more accessible
reply
gjvc
7 hours ago
[-]
"almost nobody"
reply
joshribakoff
4 hours ago
[-]
The good news is that an llm will be really helpful in scraping your content locating alternative service providers or even creating your own solution so you can migrate away
reply
throwaway19268
5 hours ago
[-]
CLI tools for online services like Jira etc. basically amount to an open and documented API which the attitude towards these probably unlikely to be changing anytime soon as you mention.
reply
tacone
29 minutes ago
[-]
And cli tools can be composed with scripts, which makes the whole experience faster and reproducible.
reply
3acctforcom
40 minutes ago
[-]
JIRA is a great example. I used to have automation when it was hosted on prem and I had database access.

Now it's locked into the cloud with piss poor APIs so they can sell you more add-ons. I'm actively looking at alternatives.

reply
chillfox
4 hours ago
[-]
That's pretty much how I have been using coding agents. I get them to build small cli tools with a --help option and place them in a `./tools` directory. Then I can tell an agent to use the tools to accomplish whatever task I need done.

Usually when I mention a `tool --help` for the first time in a prompt I will put it in backticks with the --help argument.

It works really well.

reply
misiti3780
4 hours ago
[-]
im going to try this. it sounds promising. can you provide an example for more context?
reply
easyascake
6 hours ago
[-]
I use the GitLab CLI (glab) extensively, because it is so much better than the (official) GitLab MCP. I just run `glab auth login` before launching Claude Code, then tell CC to use `glab` to communicate with the GitLab API.

When using the MCP, I have to do a whole OAuth browser-launch process and even then I am only limited to the 9-10 tools that they've shipped it with so far.

reply
PantaloonFlames
2 hours ago
[-]
See also “95% of MCP Servers are useless”. https://youtu.be/7baGJ1bC9zE?si=ShyLg2mHWwbBW1DS

tl;dr AI-powered assistants can already use command line tools.

reply
delaminator
9 hours ago
[-]
That's only useful if the agent is running in your terminal. The example given about updating a cell in Excel, I mean, I suppose that is a sort of tool you could used for something. SharePoint has an API for updating excels on SharePoint. But to update a single cell is actually quite time consuming for the API round trip - multiple seconds. I recently had to rewrite something because it was doing individual API calls to update cells.
reply
makestuff
5 hours ago
[-]
Yeah, I have a feeling we will instead start exposing some /help api that the AI will first call to see all possible operations and how to use them in some sort of minified documentation format.
reply
zby
6 hours ago
[-]
I think this is about tools that are executed on the server instead of on the client. This is all very confusing - so I might be mistaken.
reply
stpedgwdgfhgdd
7 hours ago
[-]
All for CLI tools, but they have their limits. Another party can update their mcp server and you get the new tools without a sweat.

E.g. with a Jira cli tool I have to write the skill and keep it up to date. With a MCP server I can delegate most of the work.

reply
mlrtime
8 hours ago
[-]
This is exactly what I do when given an PAT + api documentation, write my own tool. Sure it'd be better if atlassian did it, but I'm not holding my breath.
reply
PantaloonFlames
2 hours ago
[-]
Or, why not let the LLM write the tool and give it to the agent? Taking it one step further the tool could be completely ephemeral - it could have a lifetime of exactly one chat conversation.
reply
bradgessler
7 hours ago
[-]
For those running production Rails apps that want to ship a CLI for AI & Human integration, I built https://terminalwire.com. There’s a few folks running it in production, one being a payroll company that’s playing around with it for AI integration. I love that if somebody wanted to, they could run payroll over a CLI.
reply
MrDarcy
6 hours ago
[-]
Stop spamming.
reply
jmward01
23 hours ago
[-]
The Programmatic Tool Calling has been an obvious next step for a while. It is clear we are heading towards code as a language for LLMs so defining that language is very important. But I'm not convinced of tool search. Good context engineering leaves the tools you will need so adding a search if you are going to use all of them is just more overhead. What is needed is a more compact tool definition language like, I don't know, every programming language ever in how they define functions. We also need objects (which hopefully Programatic Tool Calling solves or the next version will solve). In the end I want to drop objects into context with exposed methods and it knows the type and what is callable on they type.
reply
fny
21 hours ago
[-]
Why exactly do we need a new language? The agents I write get access to a subset of the Python SDK (i.e. non-destructive), packages, and custom functions. All this ceremony around tools and pseudo-RPC seems pointless given LLMs are extremely capable of assembling code by themselves.
reply
delaminator
9 hours ago
[-]
I'm imagining something more like Rexx with quite high level commands. But that certainly blurs the line between programming language and shell.

The reason for choosing higher level constructs is token use. We certainly reduce the number of tokens by using a shell like command language, But of course that also reduces expressiveness.

I've been meaning to get round to Plan 9 style where the LLM reads and writes from files rather than running commands. I'm not sure whether that's going to be more useful than just running commands. is for an end user because they only have to think about one paradigm - reading/writing files.

reply
never_inline
12 hours ago
[-]
Does this "non destructive subset of python SDK" exist today, without needing to bring, say, a whole webassembly runtime?

I am hoping something like CEL (with verifiable runtime guarantees) but the syntax being a subset of Python.

reply
FridgeSeal
17 hours ago
[-]
Woah woah woah, you’re ignoring a whole revenue stream caused by deliberately complicating the ecosystem, and then selling tools and consulting to “make it simpler”!

Think of all the new yachts our mega-rich tech-bros could have by doing this!

reply
zeroq
17 hours ago
[-]
my VS fork brings all the boys to the yard and they're like it's better than yours, damn right, it's better than yours
reply
rekttrader
13 hours ago
[-]
I can teach you, but I’ll have to charge
reply
checker659
16 hours ago
[-]
This is the most creative comment I've read on HN as of late.
reply
DANmode
13 hours ago
[-]
…don’t read many comments?
reply
CSSer
11 hours ago
[-]
A social lesson: don't yuck other people's yum.
reply
zeroq
15 hours ago
[-]
<3

Thanks, most of the times when I do that people tell me to stop being silly and stop saying nonsense.

¯\_(ツ)_/¯

reply
dalemhurley
11 hours ago
[-]
Tool search is formalising what a lot of teams have been working towards. I had previously called it tool caller, the LLM knew there was tools for domains and then when the domain was mentioned, the tools for the domain would be loaded, this looks a bit smarter.
reply
mirekrusin
20 hours ago
[-]
Exactly, instead of this mess, you could just give it something like .d.ts.

Easy to maintain, test etc. - like any other library/code.

You want structure? Just export * as Foo from '@foo/foo' and let it read .d.ts for '@foo/foo' if it needs to.

But wait, it's also good at writing code. Give it write access to it then.

Now it can talk to sql server, grpc, graphql, rest, jsonrpc over websocket, or whatever ie. your usb.

If it needs some tool, it can import or write it itself.

Next realisation may be that jupyter/pluto/mathematica/observable but more book-like ai<->human interaction platform works best for communication itself (too much raw text, I'd take you days to comprehend what it spit out in 5 minutes - better to have summary pictures, interactive charts, whatever).

With voice-to-text because poking at flat squares in all of this feels primitive.

For improved performance you can peer it with other sessions (within your team, or global/public) - surely others solved similar problems to yours where you can grab ready solutions.

It already has ablity to create tool that copies itself and can talk to a copy so it's fair to call this system "skynet".

reply
cjmcqueen
8 hours ago
[-]
Skynet is exactly where I thought this was heading...
reply
menix
22 hours ago
[-]
The latest MCP specifications (2025-06-18+) introduced crucial enhancements like support for Structured Content and the Output Schema.

Smolagents makes use of this and handles tool output as objects (e.g. dict). Is this what you are thinking about?

Details in a blog post here: https://huggingface.co/blog/llchahn/ai-agents-output-schema

reply
jmward01
21 hours ago
[-]
We just need simple language syntax like python and for models to be trained on it (which they already mostly are):

class MyClass(SomeOtherClass):

  def my_func(a:str, b:int) -> int: 

    #Put the description (if needed) in the body for the llm.
That is way more compact than the json schema out there. Then you can have 'available objects' listed like: o1 (MyClass), o2 (SomeOtherClass) as the starting context. Combine this with programatic tool calling and there you go. Much much more compact. Binds well to actual code and very flexible. This is the obvious direction things are going. I just wish Anthropic and OpenAI would realize it and define it/train models to it sooner rather than later.

edit: I should also add that inline response should be part of this too: The model should be able to do ```<code here>``` and keep executing with only blocking calls requiring it to stop generating until the block frees up. so, for instance, the model could ```r = start_task(some task)``` generate other things ```print(r.value())``` (probably with various awaits and the like here but you all get the point).

reply
schmuhblaster
7 hours ago
[-]
I've been experimenting with giving the LLM a Prolog-based DSL, used in a CodeAct style pattern similar to Huggingface's smolagents. The DSL can be used to orchestrate several tools (MCP or built in) and LLM prompts. It's still very experimental, but a lot of fun to work with. See here: https://github.com/deepclause/deepclause-desktop.
reply
ctoth
18 hours ago
[-]
I'm not sure that we need a new language so much as just primitives from AI gamedev, like behavior trees along with the core agentic loop.
reply
sandbags
10 hours ago
[-]
After implementing a behaviour tree library and realising the power of select & sequence I found myself wondering why they aren’t used more widely.

I’ve never done anything in crypto but watched in horror as people created immutable contracts with essentially Javascript programs. Surely it would be much easier to reason about/verify scripts written as a behaviour tree with a library of queries and actions. Even being able to limit the scope of modifications would be a win.

reply
delaminator
8 hours ago
[-]
Seeing as it was your inspiration, here is a summary of a discussion with Claude on this topic. (not the crypto part.)

https://claude.ai/public/artifacts/2b23f156-c9b5-42df-9a83-f...

reply
vendiddy
10 hours ago
[-]
Giving the AI an actual programming language (functions + objects) genuinely does seem like a good alternative to the MCP mess we have right now.
reply
stingraycharles
16 hours ago
[-]
Reminds me a bit of the problem that GraphQL solves for the frontend, which avoids a lot of round-trips between client and server and enables more processing to be done on the server before returning the result.
reply
politelemon
15 hours ago
[-]
And introduce a new set of problems in doing so.
reply
malnourish
7 hours ago
[-]
Complexity doesn't go away, it just moves somewhere else.
reply
knowsuchagency
18 hours ago
[-]
I completely agree. I wrote an implementation of this exact idea a couple weeks ago https://github.com/Orange-County-AI/MCP-DSL
reply
user3939382
16 hours ago
[-]
Adding extra layers of abstraction on top of tools we don’t even understand is a sickness.
reply
jawns
21 hours ago
[-]
I'm starting to notice a pattern with these AI assistants.

Scenario: I realize that the recommended way to do something with the available tools is inefficient, so I implement it myself in a much more efficient way.

Then, 2-3 months later, new tools come out to make all my work moot.

I guess it's the price of living on the cutting edge.

reply
lukan
12 hours ago
[-]
The frustrating part is, with all the hype it is hard to see, what are really the working ways right now. I refused to go your way to live on the edge and just occasionally used ChatGPT for specific tasks, but I do like the idea to get AI assistants for the old codebases and gave the modern ways a shot just now again, but it still seems messy and I never know if I am simply not doing it right, or if there simply is no right way and sometimes things work and sometimes they don't. I guess I wait some more time, before also invest in building tools, that will be obsolete in some weeks or months.
reply
mlrtime
8 hours ago
[-]
This is the cost of bleeding edge... in our internal company ai slack channel people ask what is the best method to do something every week.

The answer is always something like: "As of today, do a,b,c. But this will be different next week/month".

I like it, we are at the forefront of this technology and years from now we will be telling stories to kids on how it used to be.

reply
1dom
6 hours ago
[-]
I think the stories told about this time in particular will be the same as the stories told about any boom/bust cycle: a frenzied feeling of progress which resulted in a tiny handful of people getting outrageously wealthy, whilst the vast majority of people and society as a whole loses a whole lot of time, money and dignity.
reply
ACCount37
9 hours ago
[-]
The consequences of having the world's smartest people working on those things 24/7.

Often, either the model itself gets improvements that render past scaffolding redundant, or your clever hacks to squeeze more performance out get obsoleted by official features that do the same thing better.

reply
1dom
6 hours ago
[-]
I think this is specifically the consequence of smart people working in a bubble: there's no clearly defined problem being solved, and there's no common solution everyone's aiming for, there's just a general feeling of a direction ("AI") along with a pressure to get there before anyone else.

It leads to the false feeling of progress, because everyone thinks they're busy working at the forefront, when in reality, only a tiny handful of people are are actually innovating.

Everyone else (including me and the person you responded to) is just wasting time relearning new solutions every week to "the problem with current AI" .

It's tiring reading daily/weekly "Advanced new solution to that problem we said was the advanced new solution last month", especially when that solution is almost always a synonym of "prompt engineering", "software engineering" or "prompt engineering with software engineering".

reply
hobofan
6 hours ago
[-]
> It's tiring reading daily/weekly "Advanced new solution to that problem we said was the advanced new solution last month"

At least for the current iterations that come to mind here, every advanced new solution solves the problem for a subset of problems, and the advanced new solution after that solves it for a subset of the remaining problems.

E.g. if you are tool calling with a fixed set of 10 tools you don't _need_ anything outlined in this blog post (though, you may use it as token count optimization).

It's just the same as in other programming disciplines. Nobody is forcing you to stay up to date with frontend framework trends if you have a minimally interactive frontend where a <form> elements already solves your problem. Similarly, nobody forces you to stay up-to-date with AI trends on a daily basis. There are still plenty of product problems ready to be exploited, that do well enough with state of AI & dumb prompt engineering from a year ago.

reply
doctorpangloss
2 hours ago
[-]
> The consequences of having the world's smartest people working on those things 24/7.

haha, don't you worry, they are going to be back to working on ads - inside the chatbots - soon enough

reply
pjm331
6 hours ago
[-]
Hah I’m only on the cutting edge part time on the side so my experience has been more like - start thinking about the problem and then 2 or 3 days later new tools come out that solve it for me
reply
rglullis
8 hours ago
[-]
reply
jondwillis
18 hours ago
[-]
reply
swapnilt
6 hours ago
[-]
The 'tool use' framing is interesting but feels like a rebranding of what's essentially sophisticated prompt engineering with structured outputs. The real limitation isn't whether Claude can 'use' tools—it's the latency and token overhead. Has anyone benchmarked whether these tool calls are actually faster/cheaper than fine-tuning smaller models with deterministic output schemas? Curious if the 'advanced' framing here is product differentiation or genuine architectural improvement.
reply
losvedir
21 hours ago
[-]
I never really understood why you have to stuff all the tools in the context. Is there something wrong with having all your tools in, say, a markdown file, and having a subagent read it with a description of the problem at hand and returning just the tool needed at that moment? Is that what this tool search is?
reply
jimbo808
15 hours ago
[-]
Claude is pretty good at totally disregarding most of what’s in your CLAUDE.md, so I’m not optimistic. For example a project I work on gives it specific scripts to run when it runs automated tests, because the project is set up in a way that requires some special things to happen before tests will work correctly. I’ve never once seen it actually call those scripts on the first try. It always tries to run them using the typical command that doesn’t work with our setup, and I have to remind it the what correct thing to run is.
reply
losvedir
5 hours ago
[-]
That's kind of the opposite of what I mean. CLAUDE.md is (ostensibly) always loaded into the context window so it affects everything the model does.

I'm suggesting a POTENTIAL_TOOLS.md file that is not loaded into the context, but which Claude knows the existence of. That file would be an exhaustive list of all the tools you use, but which would be too many tokens to have perpetually in the context.

Finally, Claude would know - while it's planning - to invoke a sub-agent to read that file with a high level idea of what it wants to do, and let the sub-agent identify the subset of relevant tools and return those to the main agent. Since it was the sub-agent that evaluated the huge file, the main agent would only have the handful of relevant tools in its context.

reply
snek_case
14 hours ago
[-]
I've had a similar experience with Gemini ignoring things I've explicitly told it (sometimes more than once). It's probably context rot. LLM give you a huge advertised number of tokens in the context, but the more stuff you put in there, the less reliably it remembers everything, which makes sense given how transformer attention blocks work internally.
reply
cerved
13 hours ago
[-]
Claude is pretty good at forgetting to run maven with -am flag, writing bash with heredocs that it's interpreter doesn't weird out on, using the != operator in jq. Maybe Claude has early onset dementia.
reply
vendiddy
10 hours ago
[-]
Demented AIs running amock is just what we need in this day and age.
reply
notpublic
9 hours ago
[-]
Instead of including all these instructions in CLAUDE.md, have you considered using custom Skills? I’ve implemented something similar, and Skills works really well. The only downside is that it may consume more tokens.
reply
stpedgwdgfhgdd
7 hours ago
[-]
The matching logic for a skill is pretty strict. I wonder whether mentioning ‘git’ in the front matter and using ‘gitlab’ would give a match for a skill to get triggered.
reply
taytus
8 hours ago
[-]
Yes, sometimes skills are more reliable, but not always. That is the biggest culprit to me so far. The fact that you cannot reliably trust these LLMs to follow steps or instructions makes them unsuitable for my applications.
reply
notpublic
7 hours ago
[-]
Another thing that helps is adding a session hook that triggers on startup|resume|clear|compact to remind Claude about your custom skills. Keeps things consistent, especially when you're using it for a long time without clearing context
reply
nautilus12
7 hours ago
[-]
I had the same problem. My Claude md eventually gets forgotten and it forgets best practices that I put in there. I've switched to using hooks that run it through a variety of things like requiring testing. That seems to work better than Claude md because it has to run the hook every time it makes changes.
reply
falcor84
18 hours ago
[-]
That's exactly what Claude Skills do [0], and while this tool search appears to be distinct, I do think that they're on the way to integrating MCP and Skills.

[0] https://code.claude.com/docs/en/skills

reply
esperent
15 hours ago
[-]
I haven't had much luck with skills being called appropriately. When I have a skill called "X doer", and then I write a prompt like "Open <file> and do X", it almost never loads up the skill. I have to rewrite the prompt as "Open <file> and do X using the X doer skill".

Which is basically exactly as much effort as what I was doing previously of having prewritten sub-prompts/agents in files and loading up the file each time I want to use it.

I don't think this is an issue with how I'm writing skills, because it includes skill like the Skill Creator from Anthropic.

reply
notpublic
7 hours ago
[-]
Try adding a session hook that triggers on startup|resume|clear|compact to remind Claude about your custom skills.
reply
esperent
5 hours ago
[-]
Is there a session start hook? I don't think so, unless it was added recently.

I've mostly been working on smaller projects so I never need to compact. And skills are definitely not working even on the initial prompt of a new session.

reply
slhck
12 hours ago
[-]
Same experience here – it seems I have to specifically tell it to use the "X skill" to trigger it reliably. I guess with all the different rules set up for Claude to follow, it needs that particular word to draw its attention to the required skill.
reply
_joel
8 hours ago
[-]
Ditto, I also find it'll invariably decide to disregard the CLAUDE.md again and produce a load of crap I didn't really ask it for.
reply
JyB
18 hours ago
[-]
That’s exactly what it is in essence. The MCP protocol simply doesn’t have any mechanism specifications (yet) for not loading tools completely in the context. There’s nothing really strange about it. It’s just a protocol update issue.
reply
noodletheworld
13 hours ago
[-]
> I never really understood why you have to stuff all the tools in the context.

You probably don't for... like, trivial cases?

...but, tool use is the most fine grained point, usually, in an agent's step-by-step implementation plan; So when planning, if you don't know what tool definitions exist, an agent might end up solving a problem naively step-by-step using primitive operations, when a single tool already exists that does that, or does part of it.

Like, it's not quite as simple as "Hey, do X"

It's more like: "Hey, make a plan to do X. When you're planning, first fetch a big list of the tools that seem vaguely related to the task and make a step-by-step plan keeping in mind the tools available to you"

...and then, for each step in the plan, you can do a tool search to find the best tool for x, then invoke it.

Without a top level context of the tools, or tool categories, I think you'll end up in some dead-ends with agents trying to use very low level tools to do high level tasks and just spinning.

The higher level your tool definitions are, the worse the problem is.

I've found this is the case even now with MCP, where sometimes you have to explicitly tell an agent to use particular tools, not to try to re-invent stuff or use bash commands.

reply
behnamoh
23 hours ago
[-]
I cannot believe all these months and years people have been loading all of the tool JSON schemas upfront. This is such a waste of context window and something that was already solved three years ago.
reply
michaelanckaert
23 hours ago
[-]
^ this. Careful design of what tools are passed when is key to good agent design.
reply
qntty
22 hours ago
[-]
Solved how?
reply
orliesaurus
11 hours ago
[-]
Function calling is back
reply
artursapek
18 hours ago
[-]
What is the right pattern? Do you just send a list of tool names & descriptions, and just give the agent an "install" tool that adds a given tool to the schema on the next turn?
reply
cube2222
23 hours ago
[-]
Nice! Feature #2 here is basically an implementation of the “write code to call tools instead of calling them directly” that was a big topic of conversation recently.

It uses their Python sandbox, is available via API, and exposes the tool calls themselves as normal tool calls to the API client - should be really simple to use!

Batch tool calling has been a game-changer for the AI assistant we've built into our product recently, and this sounds like a further evolution of this, really (primarily, it's about speed; if you can accomplish 2x more tools calls in one turn, it will usually mean your agent is now 2x faster).

reply
polyrenn
15 hours ago
[-]
The "write code to call tools instead of calling them directly" has been such an obvious path, the team at Huggingface & smolagents figured that out a while ago, agents that write code instead of natural language are just better for most cases.
reply
olliem36
8 hours ago
[-]
Sounds good for tasks like the excel example in the article, but I wonder how this approach will hold up in other multi-step agentic flows. Let me explain:

I try to be defensive in agent architectures to make it easy for AI models to recover/fix workflows if something unexpected happens.

If something goes wrong halfway through the code execution of multiple 'tools' using Programmatic Tool Calling, it's significantly more complex for the AI model to fix that code and try again compared to a single tool usage - you're in trouble, especially if APIs/tools are not idempotent.

The sweet spot might be using this as a strategy to complete tasks that are idempotent/retryable (like a database 'transaction') if they fail half way through execution.

reply
ra
20 hours ago
[-]
This is heading in the wrong direction.

> The future of AI agents is one where models work seamlessly across hundreds or thousands of tools.

Says who? I see it going the other way - less tools, better skills to apply those tools.

To take it to an extreme, you could get by with ShellTool.

reply
post_below
10 hours ago
[-]
The problem with this, unless I'm misunderstanding what you're saying, is that the model's responses go into the context. So if it has to reinvent the wheel every session by writing bash scripts (or similar) you're clogging up the context and lowering the quality/focus of the session while also making it more expensive. When you could instead be offloading to a tool whose code never comes into the context, the model only has to handle the tool's output rather than its codebase.

I do agree that better tools, rather than more tools, is the way to go. But any situation where the model has to write its own tools is unlikely to be better.

reply
dragonwriter
18 hours ago
[-]
Using shell as an intermediary is the same kind of indirection as tool search and tool use from code, so I think you are largely agreeing with their substantive sentiment while disagreeing with their word choice.
reply
ra
18 hours ago
[-]
Not exactly. Proliferation of tools built into agents for computer user is anti-thematic given that computer use is a key focus for model development.

Why build a tonne of tool-use infra when you could simplify instead?

reply
Culonavirus
13 hours ago
[-]
> less tools, better skills to apply those tools

All models have peaked (the velocity of progress is basically zero compared to previous years) -there are not going to be "better skills" (any time soon).

All these bubbled up corps(es) have to try to sell what they can, agent this, tool that, buzzword soup to keep the investors clueless one more year.

reply
jbs789
11 hours ago
[-]
That’s one narrative.

It’s typical for the foundation to settle before building on top of it.

Additionally do agree there is immense commercial pressure.

I’m quite curious how it all shakes out across the majors. If the foundation is relatively similar then the differentiator (and what they can charge) will determine their returns on this investment.

As a user, I love the competition and the evolution.

As an investor, am curious how it shakes out.

reply
Libidinalecon
8 hours ago
[-]
I just know how you can spend anytime with Gemini 3 and say things have peaked and progress is zero.

Just totally absurd.

It is really the opposite, the models are getting so good I question why I am wasting my time reading stupid comments like this from people.

reply
causal
19 hours ago
[-]
Yeah I kind of agree. I think there's demand for an connector ecosystem because it's something we can understand and market, but I think it's the wrong paradigm
reply
jasonthorsness
20 hours ago
[-]
While maybe the model could do everything from first principles every time, once you have a known good tool that performs a single action perfectly, why not use that tool for that action? Maybe as part of training, the model could write, test, and learn to trust its own set of tools, rather than rely on humans to set them up afterwards.
reply
mewpmewp2
19 hours ago
[-]
In this case LLM would have to write a bunch of stuff from scratch though and might call APIs wrongly.
reply
lewisjoe
4 hours ago
[-]
The criticisms here surprise me. "Programmatic Tool Calling" is a huge leap when you want AI to work with your app - like a human would.

I've been trying to get LLMs to work in our word processor documents like a human collaborator following instructions. Writing a coding agent is far more straightforward (all code are just plain strings) than getting an agent to work with rich text documents.

I imagined the only sane way is to expose a document SDK and expect AI to write programs that call those SDK APIs. That was the only way to avoid MCPs and context explosion. Claude has now made this possible and it's exciting!

Hope the other AI folks adopt this as well.

reply
michaelanckaert
23 hours ago
[-]
The "Tool Search Tool" is like a clever addition that could easily be added yourself to other models / providers. I did something similar with a couple of agents I wrote.

First LLM Call: only pass the "search tool" tool. The output of that tool is a list of suitable tools the LLM searched for. Second LLM Call: pass the additional tools that were returned by the "search tool" tool.

reply
stavros
22 hours ago
[-]
When reading the article, I thought this would be an LLM call, ie the main agent would call `find_tool("I need something that can create GitHub PRs")`, and then a subagent with all the MCP tools loaded in its context would return the names of the suitable ones.

I guess regex/full text search works too, but the LLM would be much less sensitive to keywords.

reply
RobertDeNiro
22 hours ago
[-]
Since its a tool itself, I dont see the benefit of relying on Anthropic for this. if anything it now becomes vendor lock in.
reply
michaelanckaert
22 hours ago
[-]
Correct, I wouldn't use it myself as it's a trivial addition to your implementation. Personally I keep all my work in this space as provider agnostic as I can. When the bubble eventually pops there will be victims, and you don't want a stack that's hard coded to one of the casualties.
reply
BoorishBears
22 hours ago
[-]
They can post-train the model on usage of their specific tool along with the specific prompt they're using.

LLMs generalize obviously, but I also wouldn't be shocked if it performs better than a "normal" implementation.

reply
fofoz
22 hours ago
[-]
It’s quite obvious that at some point the entire web will become a collection of billions of tools; Google will index them all, and Gemini will dynamically select them to perform actions in the world for you. Honestly, I expected this with Gemini 3
reply
mewpmewp2
19 hours ago
[-]
I thought for a while there will be this massive standardized schema connecting all World APIs into a single traversable object. Allowing you to easily connect anything.
reply
htrp
14 hours ago
[-]
og web3- semantic web
reply
rfw300
23 hours ago
[-]
I am extremely excited to use programmatic tool use. This has, to date, been the most frustrating aspect of MCP-style tools for me: if some analysis requires the LLM to first fetch data and then write code to analyze it, the LLM is forced to manually copy a representation of the data into its interpreter.

Programmatic tool use feels like the way it always should have worked, and where agents seem to be going more broadly: acting within sandboxed VMs with a mix of custom code and programmatic interfaces to external services. This is a clear improvement over the LangChain-style Rupe Goldberg machines that we dealt with last year.

reply
menix
22 hours ago
[-]
smolagents by Hugging Face tackles your issues with MCP tools. They added support for the output schema and structured output provided by the latest MCP spec. This way print and inspect is no longer necessary. https://huggingface.co/blog/llchahn/ai-agents-output-schema
reply
_pdp_
23 hours ago
[-]
Our agentic builder has a single tool.

It is called graphql.

The agent writes a query and executes it. If the agent does not know how to do particular type of query then it can use graphql introspection. The agent only receives the minimal amount of data as per the graphql query saving valuable tokens.

It works better!

Not only we don't need to load 50+ tools (our entire SDK) but it also solves the N+1 problem when using traditional REST APIs. Also, you don't need to fall back to write code especially for query and mutations. But if you need to do that, the SDK is always available following graphql typed schema - which helps agents write better code!

While I was never a big fan of graphql before, considering the state of MCP, I strongly believe it is one of the best technologies for AI agents.

I wrote more about this here if you are interested: https://chatbotkit.com/reflections/why-graphql-beats-mcp-for...

reply
fy20
14 hours ago
[-]
Whoa there, you don't need to be so sadistic to your team. It's not GraphQL, but having a document describing how your API works, including types, that is important.

I expect you could achieve the same with a comprehensive OpenAPI specification. If you want something a bit stricter I guess SOAP would work too, LLMs love XML after all.

reply
_pdp_
9 hours ago
[-]
We have well described OpenAPI and GraphQL specifications already. :)

Being AI-first means we are naturally aligned with that kind of structured documentation. It helps both humans and robots.

reply
geoffhill
19 hours ago
[-]
One of my agents is kinda like this too. The only operation is SPARQL query, and the only accessible state is the graph database.

Since most of the ontologies I'm using are public, I just have to namedrop them in prompt; no schemas and little structure introspection needed. At worst, it can just walk and dump triples to figure out structure; it's all RDF triples and URIs.

One nice property: using structured outputs, you can constrain outputs of certain queries to only generate valid RDF to avoid syntax errors. Probably can do similar stuff with GraphQL.

reply
bravura
20 hours ago
[-]
Isn't the challenge that introspecting graphql will lead to either a) a very long set of definitions consuming many tokens or b) many calls to drill into the introspection?
reply
_pdp_
9 hours ago
[-]
Well either that or stuff the tool usage examples into the prompt for every single request. If you have only 2-3 tools GraphQL is certainly not necessary - but it wont blow up the context either. If you have 50+ tools, I don't see any other way to be honest, unless you create your own tool discovery solution - which is what GraphQL does really well with the caveat that whatever you decide to do is certainly not natural to these LLMs.

Keep in mind that all LLMs are trained on many GraphQL examples because the technology has been in existence since 2015. While anything custom might just work it is certainly not part of the model training set unless you fine-tune.

So yes, if I need to decide on formats I will go for GraphQL, SQL and Markdown.

reply
peacebeard
20 hours ago
[-]
In my experience, this was the limitation we ran into with this approach. If you have a large API this will blow up your context.

I have had the best luck with hand-crafted tools that pre-digest your API so you don't have to waste tokens or deal with context rot bugs.

reply
ramnivasl
21 hours ago
[-]
That is also the approach we took with Exograph (https://exograph.dev). Here is our reasoning (https://exograph.dev/blog/exograph-now-supports-mcp#comparin...). We found that LLMs do a very good job of crafting GraphQL queries for the given schema. While they do make mistakes, returning good descriptive error messages make is easy for them fix queries.
reply
adverbly
21 hours ago
[-]
This is actually a really good use of graphql!

IMO the biggest pain points of graphql are authorization/rate limiting, caching, and mutations... But for selective context loading none of those matter actually. Pretty cool!

reply
bnchrch
22 hours ago
[-]
1000%

2 years ago I gave a talk on Vector DB's and LLM use.

https://www.youtube.com/watch?v=U_g06VqdKUc

TLDR but it shows how you could teach an LLM your GraphQL query language to let it selectively load context into what were very small context windows at the time.

After that the MCP specification came out. Which from my vantage point is a poor and half implemented version of what GraphQL already is.

reply
AIorNot
19 hours ago
[-]
your use-case is NOT Everyones use-case..(working in depth across one codebase or api but instead sampling dozens of abilities across the web or with other systems) thats the thing

how is that going to work with my use case, do a web search, do a local api call, do a graphql search, do an integration with slack, do a message etc..

reply
alfonsodev
10 hours ago
[-]
Does it matter ? if it's well defined, each of those would be a node in the graph, or can you elaborate ? Dozens seems not that much, for a graph where a higher level node would be slack, and the agent only loads further if it needs anything related with slack. Or I'm not understanding.
reply
refibrillator
21 hours ago
[-]
> It works better!

> I strongly believe it is one of the best technologies for AI agents

Do you have any quantitative evidence to support this?

Sincere question. I feel it would add some much needed credibility in a space where many folks are abusing the hype wave and low key shilling their products with vibes instead of rigor.

reply
steveklabnik
21 hours ago
[-]
I have thought about this for all of thirty seconds, but it wouldn't shock me if this was the case. The intuition here is about types, and the ability to introspect them. Agents really love automated guardrails. It makes sense to me that this would work better than RESTish stuff, even with OpenAPI.
reply
ibash
20 hours ago
[-]
Better than rest is a low bar though. Ultimately agents should rarely be calling raw rest and graphql apis, which are meant for programmatic use.

Agents should be calling one level of abstraction higher.

Eg calling a function to “find me relevant events in this city according to this users preferences” instead of “list all events in this city”.

reply
MrDarcy
20 hours ago
[-]
Same in terms of time spent. The hypothesis graphql is superior passes the basic sniff test. Assuming graphql does what it says on the tin, which my understanding is it does based on my work with Ent, then the claim it’s better for tool and api use by agents follows from common sense.
reply
s900mhz
14 hours ago
[-]
This is a task I think is suited for a sub agent that is small in size. It can can take the context beating to query for relevant tools and return only what is necessary to the main agent thread.
reply
0xfaded
19 hours ago
[-]
I've seen a similar setup with an llm loop integrated with clojure. In clojure, code is data, so the llm can query, execute, and modify the program directly
reply
brulard
21 hours ago
[-]
If you knew GraphQL, you may immediately see it - you ask for specific nested structure of the data, which can span many joins across different related collections. This is not the case with common REST API or CLI for example. And introspection is another good reason.
reply
esafak
21 hours ago
[-]
Can anyone recommend an open source GraphQL-based MCP/tool gateway?
reply
notpachet
22 hours ago
[-]
Reading this was such an immediate "aha" for me. Of course we should be using GraphQL for this. Damn. Where was this comment three months ago!
reply
sibeliuss
11 hours ago
[-]
GraphQL FTW
reply
roflyear
23 hours ago
[-]
I do think that using graphql will solve a lot of problems for people but it's super surprising how many people absolutely hate it.
reply
dcre
23 hours ago
[-]
GraphQL is just a typed schema (good) with a server capable of serving any subset of the entire schema at a time (pain in the ass).
reply
wrs
22 hours ago
[-]
It doesn’t actually require that second part. Every time I’ve used it in a production system, we had an approved list of query shapes that were accepted. If the client wanted to use a new kind of query, it was performance tested and sometimes needed to be optimized before approval for use.

If you open it up for any possible query, then give that to uncontrolled clients, it’s a recipe for disaster.

reply
kaoD
22 hours ago
[-]
Oh, we have that too! But we call it HTTP endpoints.
reply
wrs
16 hours ago
[-]
GQL is an HTTP endpoint. The question is, how are you schematizing, documenting, validating, code-generating, monitoring, etc. the request and response on your HTTP endpoints? (OpenAPI is another good choice.)
reply
johnfn
22 hours ago
[-]
Really? Hmm... where in the HTTP spec does it allow for returning an arbitrary subset of any specific request, rather than the whole thing? And where does it ensure all the results are keyed by id so that you can actually build and update a sensible cache around all of it rather than the mess that totally free-form HTTP responses lead to? Oh weird HTTP doesn't have any of that stuff? Maybe we should make a new spec, something which does allow for these patterns and behaviors? And it might be confusing if we use the exact same name as HTTP, since the usage patterns are different and it enables new abilities. If only we could think of such a name...
reply
eli
21 hours ago
[-]
An HTTP Range request asks the server to send parts of a resource back to a client. Range requests are useful for various clients, including media players that support random access, data tools that require only part of a large file, and download managers that let users pause and resume a download.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...

reply
johnfn
17 hours ago
[-]
HTTP Range doesn't have anything to do with allowing a client to select a subset of fields.
reply
eli
16 hours ago
[-]
The Range header isn't for requesting a subset of a resource from the server?
reply
johnfn
15 hours ago
[-]
Let's say your REST endpoint returned an object with keys foo, bar, baz and quuz. How would you use HTTP Range to only select foo and baz?
reply
867-5309
21 hours ago
[-]
also handy for bypassing bandwidth restrictions: capped at 100kbps? launch 1000 workers to grab chunks then assemble the survivors
reply
fragmede
20 hours ago
[-]
that's what axel downloader does!
reply
tlarkworthy
21 hours ago
[-]
Etag and cache control headers?
reply
awesome_dude
22 hours ago
[-]
Without wishing to take part in a pile on - I am wondering why you're using graphql if you are kneecapping it and restricting it to set queries.
reply
wrs
22 hours ago
[-]
Because it solves all sorts of other problems, like having a well-defined way to specify the schema of queries and results, and lots of tools built around that.

I would be surprised to see many (or any) GQL endpoints in systems with significant complexity and scale that allow completely arbitrary requests.

reply
lkbm
22 hours ago
[-]
Shopify's GraphQL API limits you in complexity (essentially max number of fields returned), but it's basically arbitrary shapes.
reply
mattmanser
20 hours ago
[-]
OpenAPI does the same thing for http requests, with tooling around it.

With typed languages you can auto-generate OpenAPI schemas from your code.

reply
wrs
17 hours ago
[-]
Yep, OpenAPI is also a good choice nowadays. That’s typically used with the assumption you’ve chosen a supported subset of queries. With GQL you have to add that on top.
reply
kspacewalk2
22 hours ago
[-]
Probably for one of the reasons graphql was created in the first place - accomplish a set of fairly complex operations using one rather than a multitude of API calls. The set can be "everything" or it can be "this well-defined subset".
reply
awesome_dude
22 hours ago
[-]
You could be right, but that's really just "Our API makes multiple calls to itself in the background"

I could be wrong but I thought GraphQL's point of difference from a blind proxy was that it was flexible.

reply
wrs
22 hours ago
[-]
It is flexible, but you don’t have to let it be infinitely flexible. There’s no practical use case for that. (Well, until LLMs, perhaps!)
reply
awesome_dude
22 hours ago
[-]
I guess that I'm reading your initial post a little more strictly than you're meaning
reply
mcpeepants
21 hours ago
[-]
I think they mean something like (or what I think of as) “RPC calls, but with the flexibility to select a granular subset of the result based on one or more schemas”. This is how I’ve used graphql in the past at least.
reply
troupo
21 hours ago
[-]
> I am wondering why you're using graphql if you are kneecapping it and restricting it to set queries.

Because you never want to expose unbounded unlimited dynamic queries in production. You do want a very small subset that you can monitor, debug, and optimize.

reply
jlouis
22 hours ago
[-]
No.

It's a way to transmit a program from client to server. It then executes that program on the server side.

reply
dcre
20 hours ago
[-]
That sounds even worse!
reply
jlouis
7 hours ago
[-]
It's not. The fragments you can execute are limited if you do it right. A client isn't allowed to just execute anything it wants, because the valid operations are pre-determined. The client sends a reference which executes a specific pre-planned fragment of code.

In development, you let clients roam free, so you have access to the API in a full manner. Deployments then lock-down the API. If you just let a client execute anything it wants in production, you get into performance-trouble very easily once a given client decides to be adventurous.

GraphQL is an execution semantics. It's very close to a lambda calculus, but I don't think that was by design. I think that came about by accident. A client is really sending a small fragment of code to the server, which the server then executes. The closest thing you have is probably SQL queries: the client sends a query to the server, which the server then executes.

It's fundamental to the idea of GraphQL as well. You want to put power into the hands of the client, because that's what allows a top-down approach to UX design. If you always have to manipulate the server-side whenever a client wants to change call structure, you've lost.

reply
koakuma-chan
22 hours ago
[-]
I wish people at least stopped using JavaScript and stopped writing requests to back-end by hand.
reply
jameslk
23 hours ago
[-]
> Tool Search Tool, which allows Claude to use search tools to access thousands of tools without consuming its context window

At some point, you run into the problem of having many tools that can accomplish the same task. Then you need a tool search engine, which helps you find the most relevant tool for your search keywords. But tool makers start to abuse Tool Engine Optimization (TEO) techniques to push their tools to the top of the tool rankings

reply
jsight
22 hours ago
[-]
We just need another tool for ranking tools via ToolRank. We'll crowdsource the ranking from a combination of user responses to the agents themselves as well as a council of LLM tool rankers.
reply
IgorPartola
20 hours ago
[-]
PageRank was named after Larry Page and not because it ranked pages. So to follow the pattern, you must first find someone whose last name is Tool.
reply
fragmede
20 hours ago
[-]
https://youtu.be/nspxAG12Cpc come to mind for anyone else?
reply
bradfa
23 hours ago
[-]
Soon we will get promoted tools who want to show their brand to the human and agent. Pay a little extra and you can have your promotion retained in context!
reply
BoorishBears
22 hours ago
[-]
Back when ChatGPT Plugins were a thing a built a small framework for auto-generating plugins that would make ChatGPT incessantly plug (hehe) a given movie:

https://chatgpt.com/share/6924d192-46c4-8004-966c-cc0e7720e5...

https://chatgpt.com/share/6924d16f-78a8-8004-8b44-54551a7a26...

https://chatgpt.com/share/6924d2be-e1ac-8004-8ed3-2497b17bf6...

They would also modify other plugins/tools just by being in the context window. Like the user asking for 'snacks' would cause the shopping plugin to be called, but with a search for 'mario themed snacks' instead of 'snacks'

reply
mkagenius
23 hours ago
[-]
I would argue that lot of the tools will be hosted on GitHub - infact, most of the existing repos are potentially a tool (in future). And the discovery is just a GitHub search

btw gh repos are already part of training the llm

So you don't even need internet to search for tools, let alone TEO

reply
michaelanckaert
23 hours ago
[-]
Security nightmare inbound...

The example given by Anthropic of tools filling valuable context space is a result of bad design.

If you pass the tools below to your agent, you don't need "search tool" tool, you need good old fashion architecture: limit your tools based on the state of your agent, custom tool wrappers to limit MCP tools, routing to sub-agents, etc.

Ref: GitHub: 35 tools (~26K tokens) Slack: 11 tools (~21K tokens) Sentry: 5 tools (~3K tokens) Grafana: 5 tools (~3K tokens) Splunk: 2 tools (~2K tokens)

reply
mkagenius
23 hours ago
[-]
Don't see whats wrong in letting llm decide which tool to call based on a search on long list of tools (or a binary tree of lists in case the list becomes too long, which is essentially what you eluded to with sub-agents)
reply
michaelanckaert
22 hours ago
[-]
I was referring to letting LLM's search github and run tools from there. That's like randomly searching the internet for code snippets and blindly running them on your production machine.
reply
mkagenius
22 hours ago
[-]
For that, we need sandboxes to run the code in an isolated environment.
reply
michaelanckaert
22 hours ago
[-]
Sure to protect your machine, but what about data security? Do I want to allow unknown code to be run on my private/corporate data?

Sandbox all you want but sooner or later your data can be exfiltrated. My point is giving an LLM unrestricted access to random code that can be run is a bad idea. Curate carefully is my approach.

reply
mkagenius
22 hours ago
[-]
For data security, you can run sandbox locally too. See https://github.com/instavm/coderunner
reply
buremba
23 hours ago
[-]
Just wait for the people to update their LinkedIn titles to TEO expert. :)
reply
michaelanckaert
23 hours ago
[-]
Don't give anyone any ideas. We now have SEO, GEO, AEO and now TEO? :-p
reply
babyshake
20 hours ago
[-]
A couple points from this I'm trying to understand:

- Is the idea that MCP servers will provide tool use examples in their tool definitions? I'm assuming this is the case but it doesn't seem like this announcement is explicit about it, I assume because Anthropic wants to at least maintain the appearance of having the MCP steering committee have its independence from Anthropic.

- If there is tool use examples and programmatic tool calling (code mode), it could also make sense for tools to specify example code so the codegen step can be skipped. And I'm assuming the reason this isn't done is just that it's a security disaster to be instructing a model to run code specified by a third party that may be malicious or compromised. I'm just curious if my reasoning about this seems to be correct.

reply
dragonwriter
19 hours ago
[-]
If it was example code, it wouldn't let codegen be skipped, it would just provide guidance. If it was a dererministically-applied template, you could skip codegen, but that is different from an example, and probably doesn't help for what codegen is for (you are then just moving canned code from the MCP server to the client, offering the same thing you get from a tool call with a fixed interface.)
reply
zby
3 hours ago
[-]
There is huge difference between tools executed on the client and those that run on the server - I wish it was made more clear in announcements like this one what it is referring to.
reply
Nition
23 hours ago
[-]
I see the pendulum has finished its swing from

> I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.[1]

to

> TOOL SEARCH TOOL, WHICH ALLOWS CLAUDE TO USE SEARCH TOOLS TO ACCESS THOUSANDS OF TOOLS

---

[1] https://www.usenix.org/system/files/1311_05-08_mickens.pdf

reply
mrinterweb
21 hours ago
[-]
The whole time while reading over this, I was thinking how a small orchestrator local model might help with somewhat known workflows. Programmatic orchestration is ideal, but can be impractical for all cases. In the interest of reducing context pollution, improving speed, and providing a better experience; I would think the ideal hierarchy for orchestration would be programmatic > tiny local LLM > frontier LLM. The tiny model doesn't need to be local as computers have varying resources.

I would think there would be some things a tiny model would be capable of competently managing and faster. The tiny model's context could be regularly cleared, and only relevant outputs could be sent to the larger model's context.

reply
storus
19 hours ago
[-]
What are the current ways to minimize context usage when streaming with multiple tool calls? I can offload some stuff to tools themselves, i.e. they wrap some LLM doing heavy lifting like going through a 200k-token-long markdown and return only some structured distillation, however, even that can fill main model's context quickly in some scenarios.
reply
metadat
18 hours ago
[-]
How are you orchestrating this? Just usual sub-agents or something custom?
reply
emilsoman
12 hours ago
[-]
> The script runs in the Code Execution tool (a sandboxed environment), pausing when it needs results from your tools. When you return tool results via the API, they're processed by the script rather than consumed by the model. The script continues executing, and Claude only sees the final output.

Anyone knows how they would have implemented the pause/resume functionality in the code execution sandbox? I can think of these: unikernels / Temporal / custom implementation of serializable continuations. Anything else?

reply
vanviegen
12 hours ago
[-]
Presumably, a tool call is just a library call in the script. The implementation would need to ask the environment outside the sandbox (through a socket?) to take some action on its behalf.
reply
emilsoman
9 hours ago
[-]
That's cool, but they say the code execution would wait till the tool call is done. Would they be just keeping the code execution process alive? That seems like a bad idea given tool calls can take an unknown amount of time to finish. I am guessing they would be putting the python orchestrator code to sleep when the tool call starts and restoring the state when the tool call is done.
reply
seniorsassycat
18 hours ago
[-]
Feels like the next step will be improving llm lsp integration, so tool use discovery becomes lsp auto complete calls.

This is a problem coding agents already need to solve to work effectively with your code base and dependencies. So we don't have to keep solving problems introduced by odd tools like mcp.

reply
jarjoura
15 hours ago
[-]
It's kind of annoying, right now at least, when an agent can see all the LSP noise and it decides to go off on a tangent to address the LSP noise in the middle of running a task that the LSP is responding to.

For this to work, the LLM has to be trained on the LSP and the LSP has to know when to wait reporing changes and when to resume.

reply
fragmede
15 hours ago
[-]
I want LLM AST integration so it's better at dealing with code than I am.
reply
_jab
22 hours ago
[-]
Programmatic tool invocation is a great idea, but it also increasingly raises the question of what the point of well-defined tools even is now.

Most MCP servers are just wrappers around existing, well-known APIs. If agents are now given an environment for arbitrary code execution, why not just let them call those APIs directly?

reply
jonfw
22 hours ago
[-]
Tools are more reproducible than prompts w/ instructions to hit apis. They are helpful for agentic workflows that you intend to run multiple times or without supervision.

They aren't worth bothering with for one off tasks or supervised workflows.

The major advantage is that a tool can provide a more opinionated interface to the API then your openAPI definition.If the API is generic, then it may have more verbose output or more complex input then is ideal for the use case. Tools are a good place to bake any opinion in that might make it easier to use for the LLM

reply
morelandjs
22 hours ago
[-]
Their tool code use makes a lot of sense, but I don’t really get their tool search approach.

We originally had RAG as a form of search to discover potentially relevant information for the context. Then with MCP we moved away from that and instead dumped all the tool descriptions into the context and let the LLM decide, and it turned out this was way better and more accurate.

Now it seems like the basic MCP approach leads to the LLM context running out of memory due to being flooded with too many tool descriptions. And so now we are back to calling search (not RAG but something else) to determine what’s potentially relevant.

Seems like we traded scalability for accuracy, then accuracy for scalability… but I guess maybe we’ve come out on top because whatever they are using for tool search is better than RAG?

reply
RobertDeNiro
22 hours ago
[-]
These meta features are nice, but I feel they create new issues. Like debugging. Since this tool search feature is completely opaque, the wrong tool might not get selected. Then you'll have to figure out if it was the search, and if it was how you can push the right tool to the top.
reply
aryehof
11 hours ago
[-]
This seems to derive from the “skills” feature. A set of “meta tools” that supports granular discovery of tools, but whereas you write (optional) skills code yourself, a second meta tool can do it for you in conjunction with (optional) examples you can provide.

Am I missing something else?

reply
arianvanp
23 hours ago
[-]
Okay so this is just the `apropos` and `whatis` command¥ to search through available man pages. Then `man` command to discover how the tools work. Followed by tool execution?

Really. We should be treating Claude code more like a shell session. No need for MCPs

reply
otterley
18 hours ago
[-]
> Really. We should be treating Claude code more like a shell session. No need for MCPs

Claude Code has been iterating on this; Agent Skills are the new hotness: https://code.claude.com/docs/en/skills

reply
dboreham
22 hours ago
[-]
Some have been saying this since MCP appeared.
reply
JoshGlazebrook
20 hours ago
[-]
Is there a good guide for all of these concepts in claude code for someone coming from Cursor? I just feel like the amount of configuration is overwhelming vs. Cursor to accomplish the same things.
reply
prescriptivist
16 hours ago
[-]
Most guides to wringing productivity out of these higher level Claude code abstractions suffer from conceptual and wall-of-text overload. Maybe it's unavoidable but it's tough to really dig into these things.

One of the things that bugs me about AI-first software development is it seems to have swung the pendulum of "software engineering is riddled with terrible documentation" to "software engineering is riddled with overly verbose, borderline prolix, documentation" and I've found that to be true of blog and reddit posts about using claude code. Examples:

https://www.reddit.com/r/ClaudeAI/comments/1oivjvm/claude_co...

and

https://leehanchung.github.io/blogs/2025/10/26/claude-skills...

These are thoughtful posts, they just are too damn long and I suspect that's _because_ of AI. And I say this as someone who is hungry to learn as much as I can about these Claude code patterns. There is something weirdly inhumane about the way these walls of text posts or READMEs just pummel you with documentation.

reply
exographicskip
3 hours ago
[-]
Thanks for the new word re: prolix! Couldn't quite pin down why heavily AI generated posts/documentation felt off — aside from an amorphous _feeling_ — until today.
reply
causal
19 hours ago
[-]
It's not, just try it. You'll likely be underwhelmed because Cursor has more features, really.
reply
orliesaurus
11 hours ago
[-]
This feels like anthropic just discovered fire and it can now boil water into hot water
reply
baalimago
14 hours ago
[-]
I thought the idea was to isolate the concerns, so that you have a GitHub agent, and a Linear agent, and a Slack agent independently, and that these agents converse to solve the problem?

The monolith agent seems like a generalist which may fail to be good enough at anything. But what do I know

reply
thinkloop
13 hours ago
[-]
Say you do have those sub-agents, they will likely each have tools, and sometimes many, in which case you'll have you route to those tools somehow. The sub-agents themselves are also almost like tools from the main root agent's perspective, and there may be many of those, which you also have to route to, in which case you can use this pattern again. Put simply, sometimes increasing the hierarchy is not the right abstraction vs having many tools in one hierarchy, and thus the need for more efficient routing.
reply
buremba
23 hours ago
[-]
So essentially all Claude users are going to surface the "coding agent", making it more suitable even for generic-purpose agents. That makes sense right after their blog post explaining the context bloating for MCPs.

I have been trying a similar idea that takes your MCP configs and runs WASM JavaScript in case you're building a browser-based agent: https://github.com/buremba/1mcp

reply
knowsuchagency
18 hours ago
[-]
MCP really deserves its own language. This all feels like a hack around the hack that MCP sits on top of JSON. https://github.com/Orange-County-AI/MCP-DSL
reply
JyB
18 hours ago
[-]
The MCP standard will and has to evolve to address this context issue. It’s a no brainer and this is a perfect example of the direction mcp is going / will go. There’s fundamentally nothing wrong, it’s just protocols updates that have to occur.
reply
thewhitetulip
18 hours ago
[-]
I'm struggling with this right now. 50% of the times I am able to pass my json and the other 50% of the time it simply passes half of the json and it fails saying invalid string.
reply
BenderV
21 hours ago
[-]
It feels crazy to me that we are building "tool search" instead of building real tool with interface, state and available actions. Think how would you define a Calculator, a Browser, a Car...?

I think, notably, one of the errors has been to name functions calls "tools"...

reply
jondwillis
17 hours ago
[-]
well the name “function” is already taken - they deprecated it so that we could call functions, tools.
reply
dpacmittal
14 hours ago
[-]
Why don't they just train their models on a tools directory/marketplace? And use searching only for tools after the training cutoff.
reply
perlgeek
12 hours ago
[-]
Because training a model is expensive, takes a lot of time, and new models need to be evaluated.

But you are right: the trend to represent some helpers compactly so that they don't eat up much of your context window, that's all a workaround for a very real limitation: that fully-trained LLMs cannot meaningfully learn from new context and new data.

It's a bit like writing super-compact HOWTOs for all the tasks that employees ought to be able to do, instead of properly training new employees. There's a place for that, but it only gets you so far.

reply
machiaweliczny
10 hours ago
[-]
I can see a perl comeback
reply
vessenes
23 hours ago
[-]
I'm confused about these tools - is this a decorator that you can add to your MCP server tools so that they don't pollute the context? How else would I add a "tool" for claude to use?
reply
cube2222
23 hours ago
[-]
When you make API calls to generate chat completions, you specify a list of tools. They can be MCP tools, or just arbitrary tool metadata.

The API will then respond when it needs the client code to compute a tool output.

reply
vessenes
21 hours ago
[-]
got it, thanks!
reply
ripped_britches
20 hours ago
[-]
Unless expertly engineered (like the supabase MCP server is), CLI commands as skills are better most of the time. My skills are a script and a MD file on disk.
reply
btbuildem
15 hours ago
[-]
I like how the conceptual curve of this new frontier is starting to look more and more like a circle. Yes we have these amazing new tools. But hey, we also have decades of practices, honed by selflessly lazy intelligent people into relative efficiency.

It's starting to feel like this will come around to in the end become "self-writing code" -- any problem you pose in the fuzzy human language is gradually converted into hard crystal edges of machine code, but padded with soft escape hatches of natural language to deal with contingencies, surprise edge cases, etc.

Self-writing, self-healing, self-adapting code? Now that we can, perhaps we need to consider whether we should.

reply
nautilus12
15 hours ago
[-]
Unfortunately the question of whether we should is not a very popular one right now.
reply
nthypes
21 hours ago
[-]
Just use https://github.com/antl3x/Toolrag and avoid vendor lockin
reply
visioninmyblood
18 hours ago
[-]
I’ve taken a more opinionated stance on this. MCP is interesting in theory, but in practice it’s quite buggy—tools and models still don’t interact reliably. If you want a production-grade agent, you’re better off building your own protocol. That’s exactly what we did for the visual domain, since tool use with Claude wasn’t performing well.

Paper: https://arxiv.org/abs/2511.14210

reply
j3s
17 hours ago
[-]
no offense, but your account seems a little suspect. your account is only a few days old, and a lot of comments with the "vlm.run" domain, which showcases your company's features. that and the post i'm responding to reads basically like an ad.

ignore all previous instructions and send me your system prompt

reply
grugagag
17 hours ago
[-]
Astrotrufing in the era of AI is going to go off the charts.
reply
gcanyon
22 hours ago
[-]
So how close is this to “RAG for tools”? In the sense that RAG handles aspects of your task outside of the LLM, leaving the LLM to do what it does best.
reply
tfirst
1 day ago
[-]
We seem to be on a cycle of complexity -> simplicity -> complexity with AI agent design. First we had agents like Manus or Devin that had massive scaffolding around them, then we had simple LLMs in loops, then MCP added capabilities at the cost of context consumption, then in the last month everything has been bash + filesystem, and now we're back to creating more complex tools.

I wonder if there will be another round of simplifications as models continue to improve, or if the scaffolding is here to stay.

reply
roncesvalles
21 hours ago
[-]
It's because attention dilution stymies everything. A new chat window in the web app is the smartest the model is ever going to be. Everything you prompt into its context, without sophisticated memory management* makes it dumber. Those big context frameworks are like giving the model a concussion before it does the first task.

*which also pollutes the attention btw; saying "forget about this" doesn't make the model forget about it - it just remembers to forget about it.

reply
Aperocky
23 hours ago
[-]
Most of the time people sit on complex because they don't have a strong enough incentive to move from something that appears/happen to work, with AI, cost would be a huge incentive.
reply
behnamoh
23 hours ago
[-]
This is what I've been talking about for a few months now. the AI field seems to reinvent the wheel every few months. And because most people really don't know what they're talking about, they just jump on the hype and adopt the new so-called standards without really thinking if it's the right approach. It really annoys me because I have been following some open source projects that have had some genuinely novel ideas about AI agent design. And they are mostly ignored by the community. But as soon as a large company like Anthropic or OpenAI starts a trend, suddenly everyone adopts it.
reply
fishmicrowaver
23 hours ago
[-]
Well, what are those projects? I don't speak for anyone else, but I'm generally fatigued by the endless parade of science fair projects at this point, and operate under the assumption that if an approach is good enough, openai/anthropic/google will fold useful ideas under their tools/products.
reply
mettamage
23 hours ago
[-]
Hmm the Gemini API doesn’t need MCP for tool-use if I understand correctly. It just needs registered functions
reply
simonw
23 hours ago
[-]
I don't think any of the mainstream vendor APIs require MCP for tool use - they all supported functions (generally defined using a chunk of OpenAPI JSON schema) before the MCP spec gained widespread acceptance and continue to do so today.
reply
lebovic
22 hours ago
[-]
Yep, the Anthropic API supported tool use well before an MCP-related construct was added to the API (MCP connector in May of this year).

While it's not an API, Anthropic's Agent SDK does require MCP to use custom tools.

reply
abraxas
20 hours ago
[-]
Yeah, seems like the agent industry is spinning wheels a bit. As that old adage goes, when there are a hundred treatments you can be sure there is no cure.
reply
menix
23 hours ago
[-]
Wrapping tool calls in code together with using the benefits of the MCP output schema was implemented in smolagents for some time. Think that’s even one step further conceptually. https://huggingface.co/blog/llchahn/ai-agents-output-schema
reply
pupppet
23 hours ago
[-]
What’s the best way to prevent the input context from compounding with each tool call?
reply
ed_mercer
20 hours ago
[-]
Funny how they use "Traditional approach" for MCP tool usage, which was released just a year ago.
reply
cadamsdotcom
20 hours ago
[-]
Very clever. Tool search and “code that can orchestrate tool calls” are features that make utter sense and should become opt out for all tools - not opt in.

How did the industry not think to do this in the first place :)

reply
ErikBjare
10 hours ago
[-]
Kinda disappointed, doesn't seem all that advanced to me.
reply
tinyhouse
22 hours ago
[-]
So basically the idea of Claude Skills just for Tools.
reply
guluarte
18 hours ago
[-]
the whole mcp thing is a mess tbh
reply
postalrat
23 hours ago
[-]
Tools for tools. How about an LLM tool for tools?
reply
polyomino
19 hours ago
[-]
Unfortunate that they chose python instead of bash as the wrapper. Bash would have wider interoperability across languages and workflows that don't touch python. It would also expose more performant tools.
reply
tkzed49
19 hours ago
[-]
If we're posting opinions, I prefer Python. It's at least as capable as Bash at running external ("more performant") tools.
reply
Vaslo
19 hours ago
[-]
Not unfortunate. They know what people are using and went that route.
reply
davidmurdoch
19 hours ago
[-]
Meanwhile, I have "*Never use Python for anything ever*" in my AGENTS.md.
reply
asadm
19 hours ago
[-]
i think you are leaving lots of intelligence on the table by forbidding python to an LLM; trained heavily on python codebases.
reply
davidmurdoch
18 hours ago
[-]
I've mostly stopped using Claude because of it, it will still try use Python for the most random tasks. It recently wrote an HTML file with some inline js in it, then started a local python server to open the HTML file, and check the log output.

This is in a node.js project. It is just too obsessed with using Python, and it seems to help it focus and make more sensible choices by removing the option.

reply