I think there's also another major reason people don't like to ship desktop software, and that's the cost of support of dealing with outdated tools, it can be immense.
Ticket as raised, "Why is my <Product> broken?"
After several rounds of clarification, it's established they're using a 6-year old version that's hitting API endpoints that were first deprecated 3 years ago and finally now removed...
It's incredibly expensive to support multiple versions of products. On-prem / self host means you have to support several, but at least with web products it's expected they'll phone-home and nag to be updated and that there'll be someone qualified to do that update.
When you add runnable executable tooling, it magnifies the issue of how old that tooling gets.
Even with a support policy of not supporting versions older than <X>, you'll waste a lot of customer support time dealing with issues only for it to emerge it's out-dated software.
Obviously it depends on your audience, and 3 rounds is exaggerating for the worst case, but in previous places I've worked I've seen customer support requests where the first question that needed to be asked wasn't, "What version are you using?", it's "Are you sure this is our product you're using?".
Actually getting version info out of that audience would have been at least an email explaining the exact steps, then possibly a follow up phone call to talk them through it and reassure them.
If your reference is JIRA tickets or you're selling software to software developers, then you're dealing with a heavily filtered stream. Ask your CS team for a look at the unfiltered incoming mail, it might be eye-opening if you've not done it before. You might be surprised just how much of their time is spent covering the absolute basics, often to people who have had the same support multiple times before.
Fast forward 12-18 months, after several new features ship and several breaking API changes are made and teams that ship CLIs start to realize it’s actually a big undertaking to keep installed CLI software up-to-date with the API. It turns out there’s a lot of auto-updating infrastructure that has to be managed and even if the team gets that right, it can still be tricky managing which versions get deprecated vs not.
I built Terminalwire (https://terminalwire.com) to solve this problem. It replaces JSON APIs with a smaller API that streams stdio (kind of like ssh), and other commands that control browsers, security, and file access to the client.
It’s so weird to me how each company wants to ship their own CLI and auto-update infrastructure around it. It’s analogous to companies wanting to ship their own browser to consume their own website and deal with all the auto update infrastructure around that. It’s madness.
I could imagine a subagent that builds a tool on demand when it's needed.
Claude is really good at building small tools like these.
Now it's locked into the cloud with piss poor APIs so they can sell you more add-ons. I'm actively looking at alternatives.
Usually when I mention a `tool --help` for the first time in a prompt I will put it in backticks with the --help argument.
It works really well.
When using the MCP, I have to do a whole OAuth browser-launch process and even then I am only limited to the 9-10 tools that they've shipped it with so far.
tl;dr AI-powered assistants can already use command line tools.
E.g. with a Jira cli tool I have to write the skill and keep it up to date. With a MCP server I can delegate most of the work.
The reason for choosing higher level constructs is token use. We certainly reduce the number of tokens by using a shell like command language, But of course that also reduces expressiveness.
I've been meaning to get round to Plan 9 style where the LLM reads and writes from files rather than running commands. I'm not sure whether that's going to be more useful than just running commands. is for an end user because they only have to think about one paradigm - reading/writing files.
I am hoping something like CEL (with verifiable runtime guarantees) but the syntax being a subset of Python.
Think of all the new yachts our mega-rich tech-bros could have by doing this!
Thanks, most of the times when I do that people tell me to stop being silly and stop saying nonsense.
¯\_(ツ)_/¯
Easy to maintain, test etc. - like any other library/code.
You want structure? Just export * as Foo from '@foo/foo' and let it read .d.ts for '@foo/foo' if it needs to.
But wait, it's also good at writing code. Give it write access to it then.
Now it can talk to sql server, grpc, graphql, rest, jsonrpc over websocket, or whatever ie. your usb.
If it needs some tool, it can import or write it itself.
Next realisation may be that jupyter/pluto/mathematica/observable but more book-like ai<->human interaction platform works best for communication itself (too much raw text, I'd take you days to comprehend what it spit out in 5 minutes - better to have summary pictures, interactive charts, whatever).
With voice-to-text because poking at flat squares in all of this feels primitive.
For improved performance you can peer it with other sessions (within your team, or global/public) - surely others solved similar problems to yours where you can grab ready solutions.
It already has ablity to create tool that copies itself and can talk to a copy so it's fair to call this system "skynet".
Smolagents makes use of this and handles tool output as objects (e.g. dict). Is this what you are thinking about?
Details in a blog post here: https://huggingface.co/blog/llchahn/ai-agents-output-schema
class MyClass(SomeOtherClass):
def my_func(a:str, b:int) -> int:
#Put the description (if needed) in the body for the llm.
That is way more compact than the json schema out there. Then you can have 'available objects' listed like: o1 (MyClass), o2 (SomeOtherClass) as the starting context. Combine this with programatic tool calling and there you go. Much much more compact. Binds well to actual code and very flexible. This is the obvious direction things are going. I just wish Anthropic and OpenAI would realize it and define it/train models to it sooner rather than later.edit: I should also add that inline response should be part of this too: The model should be able to do ```<code here>``` and keep executing with only blocking calls requiring it to stop generating until the block frees up. so, for instance, the model could ```r = start_task(some task)``` generate other things ```print(r.value())``` (probably with various awaits and the like here but you all get the point).
I’ve never done anything in crypto but watched in horror as people created immutable contracts with essentially Javascript programs. Surely it would be much easier to reason about/verify scripts written as a behaviour tree with a library of queries and actions. Even being able to limit the scope of modifications would be a win.
https://claude.ai/public/artifacts/2b23f156-c9b5-42df-9a83-f...
Scenario: I realize that the recommended way to do something with the available tools is inefficient, so I implement it myself in a much more efficient way.
Then, 2-3 months later, new tools come out to make all my work moot.
I guess it's the price of living on the cutting edge.
The answer is always something like: "As of today, do a,b,c. But this will be different next week/month".
I like it, we are at the forefront of this technology and years from now we will be telling stories to kids on how it used to be.
Often, either the model itself gets improvements that render past scaffolding redundant, or your clever hacks to squeeze more performance out get obsoleted by official features that do the same thing better.
It leads to the false feeling of progress, because everyone thinks they're busy working at the forefront, when in reality, only a tiny handful of people are are actually innovating.
Everyone else (including me and the person you responded to) is just wasting time relearning new solutions every week to "the problem with current AI" .
It's tiring reading daily/weekly "Advanced new solution to that problem we said was the advanced new solution last month", especially when that solution is almost always a synonym of "prompt engineering", "software engineering" or "prompt engineering with software engineering".
At least for the current iterations that come to mind here, every advanced new solution solves the problem for a subset of problems, and the advanced new solution after that solves it for a subset of the remaining problems.
E.g. if you are tool calling with a fixed set of 10 tools you don't _need_ anything outlined in this blog post (though, you may use it as token count optimization).
It's just the same as in other programming disciplines. Nobody is forcing you to stay up to date with frontend framework trends if you have a minimally interactive frontend where a <form> elements already solves your problem. Similarly, nobody forces you to stay up-to-date with AI trends on a daily basis. There are still plenty of product problems ready to be exploited, that do well enough with state of AI & dumb prompt engineering from a year ago.
haha, don't you worry, they are going to be back to working on ads - inside the chatbots - soon enough
I'm suggesting a POTENTIAL_TOOLS.md file that is not loaded into the context, but which Claude knows the existence of. That file would be an exhaustive list of all the tools you use, but which would be too many tokens to have perpetually in the context.
Finally, Claude would know - while it's planning - to invoke a sub-agent to read that file with a high level idea of what it wants to do, and let the sub-agent identify the subset of relevant tools and return those to the main agent. Since it was the sub-agent that evaluated the huge file, the main agent would only have the handful of relevant tools in its context.
Which is basically exactly as much effort as what I was doing previously of having prewritten sub-prompts/agents in files and loading up the file each time I want to use it.
I don't think this is an issue with how I'm writing skills, because it includes skill like the Skill Creator from Anthropic.
I've mostly been working on smaller projects so I never need to compact. And skills are definitely not working even on the initial prompt of a new session.
You probably don't for... like, trivial cases?
...but, tool use is the most fine grained point, usually, in an agent's step-by-step implementation plan; So when planning, if you don't know what tool definitions exist, an agent might end up solving a problem naively step-by-step using primitive operations, when a single tool already exists that does that, or does part of it.
Like, it's not quite as simple as "Hey, do X"
It's more like: "Hey, make a plan to do X. When you're planning, first fetch a big list of the tools that seem vaguely related to the task and make a step-by-step plan keeping in mind the tools available to you"
...and then, for each step in the plan, you can do a tool search to find the best tool for x, then invoke it.
Without a top level context of the tools, or tool categories, I think you'll end up in some dead-ends with agents trying to use very low level tools to do high level tasks and just spinning.
The higher level your tool definitions are, the worse the problem is.
I've found this is the case even now with MCP, where sometimes you have to explicitly tell an agent to use particular tools, not to try to re-invent stuff or use bash commands.
It uses their Python sandbox, is available via API, and exposes the tool calls themselves as normal tool calls to the API client - should be really simple to use!
Batch tool calling has been a game-changer for the AI assistant we've built into our product recently, and this sounds like a further evolution of this, really (primarily, it's about speed; if you can accomplish 2x more tools calls in one turn, it will usually mean your agent is now 2x faster).
I try to be defensive in agent architectures to make it easy for AI models to recover/fix workflows if something unexpected happens.
If something goes wrong halfway through the code execution of multiple 'tools' using Programmatic Tool Calling, it's significantly more complex for the AI model to fix that code and try again compared to a single tool usage - you're in trouble, especially if APIs/tools are not idempotent.
The sweet spot might be using this as a strategy to complete tasks that are idempotent/retryable (like a database 'transaction') if they fail half way through execution.
> The future of AI agents is one where models work seamlessly across hundreds or thousands of tools.
Says who? I see it going the other way - less tools, better skills to apply those tools.
To take it to an extreme, you could get by with ShellTool.
I do agree that better tools, rather than more tools, is the way to go. But any situation where the model has to write its own tools is unlikely to be better.
Why build a tonne of tool-use infra when you could simplify instead?
All models have peaked (the velocity of progress is basically zero compared to previous years) -there are not going to be "better skills" (any time soon).
All these bubbled up corps(es) have to try to sell what they can, agent this, tool that, buzzword soup to keep the investors clueless one more year.
It’s typical for the foundation to settle before building on top of it.
Additionally do agree there is immense commercial pressure.
I’m quite curious how it all shakes out across the majors. If the foundation is relatively similar then the differentiator (and what they can charge) will determine their returns on this investment.
As a user, I love the competition and the evolution.
As an investor, am curious how it shakes out.
Just totally absurd.
It is really the opposite, the models are getting so good I question why I am wasting my time reading stupid comments like this from people.
I've been trying to get LLMs to work in our word processor documents like a human collaborator following instructions. Writing a coding agent is far more straightforward (all code are just plain strings) than getting an agent to work with rich text documents.
I imagined the only sane way is to expose a document SDK and expect AI to write programs that call those SDK APIs. That was the only way to avoid MCPs and context explosion. Claude has now made this possible and it's exciting!
Hope the other AI folks adopt this as well.
First LLM Call: only pass the "search tool" tool. The output of that tool is a list of suitable tools the LLM searched for. Second LLM Call: pass the additional tools that were returned by the "search tool" tool.
I guess regex/full text search works too, but the LLM would be much less sensitive to keywords.
LLMs generalize obviously, but I also wouldn't be shocked if it performs better than a "normal" implementation.
Programmatic tool use feels like the way it always should have worked, and where agents seem to be going more broadly: acting within sandboxed VMs with a mix of custom code and programmatic interfaces to external services. This is a clear improvement over the LangChain-style Rupe Goldberg machines that we dealt with last year.
It is called graphql.
The agent writes a query and executes it. If the agent does not know how to do particular type of query then it can use graphql introspection. The agent only receives the minimal amount of data as per the graphql query saving valuable tokens.
It works better!
Not only we don't need to load 50+ tools (our entire SDK) but it also solves the N+1 problem when using traditional REST APIs. Also, you don't need to fall back to write code especially for query and mutations. But if you need to do that, the SDK is always available following graphql typed schema - which helps agents write better code!
While I was never a big fan of graphql before, considering the state of MCP, I strongly believe it is one of the best technologies for AI agents.
I wrote more about this here if you are interested: https://chatbotkit.com/reflections/why-graphql-beats-mcp-for...
I expect you could achieve the same with a comprehensive OpenAPI specification. If you want something a bit stricter I guess SOAP would work too, LLMs love XML after all.
Being AI-first means we are naturally aligned with that kind of structured documentation. It helps both humans and robots.
Since most of the ontologies I'm using are public, I just have to namedrop them in prompt; no schemas and little structure introspection needed. At worst, it can just walk and dump triples to figure out structure; it's all RDF triples and URIs.
One nice property: using structured outputs, you can constrain outputs of certain queries to only generate valid RDF to avoid syntax errors. Probably can do similar stuff with GraphQL.
Keep in mind that all LLMs are trained on many GraphQL examples because the technology has been in existence since 2015. While anything custom might just work it is certainly not part of the model training set unless you fine-tune.
So yes, if I need to decide on formats I will go for GraphQL, SQL and Markdown.
I have had the best luck with hand-crafted tools that pre-digest your API so you don't have to waste tokens or deal with context rot bugs.
IMO the biggest pain points of graphql are authorization/rate limiting, caching, and mutations... But for selective context loading none of those matter actually. Pretty cool!
2 years ago I gave a talk on Vector DB's and LLM use.
https://www.youtube.com/watch?v=U_g06VqdKUc
TLDR but it shows how you could teach an LLM your GraphQL query language to let it selectively load context into what were very small context windows at the time.
After that the MCP specification came out. Which from my vantage point is a poor and half implemented version of what GraphQL already is.
how is that going to work with my use case, do a web search, do a local api call, do a graphql search, do an integration with slack, do a message etc..
> I strongly believe it is one of the best technologies for AI agents
Do you have any quantitative evidence to support this?
Sincere question. I feel it would add some much needed credibility in a space where many folks are abusing the hype wave and low key shilling their products with vibes instead of rigor.
Agents should be calling one level of abstraction higher.
Eg calling a function to “find me relevant events in this city according to this users preferences” instead of “list all events in this city”.
If you open it up for any possible query, then give that to uncontrolled clients, it’s a recipe for disaster.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
I would be surprised to see many (or any) GQL endpoints in systems with significant complexity and scale that allow completely arbitrary requests.
With typed languages you can auto-generate OpenAPI schemas from your code.
I could be wrong but I thought GraphQL's point of difference from a blind proxy was that it was flexible.
Because you never want to expose unbounded unlimited dynamic queries in production. You do want a very small subset that you can monitor, debug, and optimize.
It's a way to transmit a program from client to server. It then executes that program on the server side.
In development, you let clients roam free, so you have access to the API in a full manner. Deployments then lock-down the API. If you just let a client execute anything it wants in production, you get into performance-trouble very easily once a given client decides to be adventurous.
GraphQL is an execution semantics. It's very close to a lambda calculus, but I don't think that was by design. I think that came about by accident. A client is really sending a small fragment of code to the server, which the server then executes. The closest thing you have is probably SQL queries: the client sends a query to the server, which the server then executes.
It's fundamental to the idea of GraphQL as well. You want to put power into the hands of the client, because that's what allows a top-down approach to UX design. If you always have to manipulate the server-side whenever a client wants to change call structure, you've lost.
At some point, you run into the problem of having many tools that can accomplish the same task. Then you need a tool search engine, which helps you find the most relevant tool for your search keywords. But tool makers start to abuse Tool Engine Optimization (TEO) techniques to push their tools to the top of the tool rankings
https://chatgpt.com/share/6924d192-46c4-8004-966c-cc0e7720e5...
https://chatgpt.com/share/6924d16f-78a8-8004-8b44-54551a7a26...
https://chatgpt.com/share/6924d2be-e1ac-8004-8ed3-2497b17bf6...
They would also modify other plugins/tools just by being in the context window. Like the user asking for 'snacks' would cause the shopping plugin to be called, but with a search for 'mario themed snacks' instead of 'snacks'
btw gh repos are already part of training the llm
So you don't even need internet to search for tools, let alone TEO
The example given by Anthropic of tools filling valuable context space is a result of bad design.
If you pass the tools below to your agent, you don't need "search tool" tool, you need good old fashion architecture: limit your tools based on the state of your agent, custom tool wrappers to limit MCP tools, routing to sub-agents, etc.
Ref: GitHub: 35 tools (~26K tokens) Slack: 11 tools (~21K tokens) Sentry: 5 tools (~3K tokens) Grafana: 5 tools (~3K tokens) Splunk: 2 tools (~2K tokens)
Sandbox all you want but sooner or later your data can be exfiltrated. My point is giving an LLM unrestricted access to random code that can be run is a bad idea. Curate carefully is my approach.
- Is the idea that MCP servers will provide tool use examples in their tool definitions? I'm assuming this is the case but it doesn't seem like this announcement is explicit about it, I assume because Anthropic wants to at least maintain the appearance of having the MCP steering committee have its independence from Anthropic.
- If there is tool use examples and programmatic tool calling (code mode), it could also make sense for tools to specify example code so the codegen step can be skipped. And I'm assuming the reason this isn't done is just that it's a security disaster to be instructing a model to run code specified by a third party that may be malicious or compromised. I'm just curious if my reasoning about this seems to be correct.
> I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS.[1]
to
> TOOL SEARCH TOOL, WHICH ALLOWS CLAUDE TO USE SEARCH TOOLS TO ACCESS THOUSANDS OF TOOLS
---
[1] https://www.usenix.org/system/files/1311_05-08_mickens.pdf
I would think there would be some things a tiny model would be capable of competently managing and faster. The tiny model's context could be regularly cleared, and only relevant outputs could be sent to the larger model's context.
Anyone knows how they would have implemented the pause/resume functionality in the code execution sandbox? I can think of these: unikernels / Temporal / custom implementation of serializable continuations. Anything else?
This is a problem coding agents already need to solve to work effectively with your code base and dependencies. So we don't have to keep solving problems introduced by odd tools like mcp.
For this to work, the LLM has to be trained on the LSP and the LSP has to know when to wait reporing changes and when to resume.
Most MCP servers are just wrappers around existing, well-known APIs. If agents are now given an environment for arbitrary code execution, why not just let them call those APIs directly?
They aren't worth bothering with for one off tasks or supervised workflows.
The major advantage is that a tool can provide a more opinionated interface to the API then your openAPI definition.If the API is generic, then it may have more verbose output or more complex input then is ideal for the use case. Tools are a good place to bake any opinion in that might make it easier to use for the LLM
We originally had RAG as a form of search to discover potentially relevant information for the context. Then with MCP we moved away from that and instead dumped all the tool descriptions into the context and let the LLM decide, and it turned out this was way better and more accurate.
Now it seems like the basic MCP approach leads to the LLM context running out of memory due to being flooded with too many tool descriptions. And so now we are back to calling search (not RAG but something else) to determine what’s potentially relevant.
Seems like we traded scalability for accuracy, then accuracy for scalability… but I guess maybe we’ve come out on top because whatever they are using for tool search is better than RAG?
Am I missing something else?
Really. We should be treating Claude code more like a shell session. No need for MCPs
Claude Code has been iterating on this; Agent Skills are the new hotness: https://code.claude.com/docs/en/skills
One of the things that bugs me about AI-first software development is it seems to have swung the pendulum of "software engineering is riddled with terrible documentation" to "software engineering is riddled with overly verbose, borderline prolix, documentation" and I've found that to be true of blog and reddit posts about using claude code. Examples:
https://www.reddit.com/r/ClaudeAI/comments/1oivjvm/claude_co...
and
https://leehanchung.github.io/blogs/2025/10/26/claude-skills...
These are thoughtful posts, they just are too damn long and I suspect that's _because_ of AI. And I say this as someone who is hungry to learn as much as I can about these Claude code patterns. There is something weirdly inhumane about the way these walls of text posts or READMEs just pummel you with documentation.
The monolith agent seems like a generalist which may fail to be good enough at anything. But what do I know
I have been trying a similar idea that takes your MCP configs and runs WASM JavaScript in case you're building a browser-based agent: https://github.com/buremba/1mcp
I think, notably, one of the errors has been to name functions calls "tools"...
But you are right: the trend to represent some helpers compactly so that they don't eat up much of your context window, that's all a workaround for a very real limitation: that fully-trained LLMs cannot meaningfully learn from new context and new data.
It's a bit like writing super-compact HOWTOs for all the tasks that employees ought to be able to do, instead of properly training new employees. There's a place for that, but it only gets you so far.
The API will then respond when it needs the client code to compute a tool output.
It's starting to feel like this will come around to in the end become "self-writing code" -- any problem you pose in the fuzzy human language is gradually converted into hard crystal edges of machine code, but padded with soft escape hatches of natural language to deal with contingencies, surprise edge cases, etc.
Self-writing, self-healing, self-adapting code? Now that we can, perhaps we need to consider whether we should.
ignore all previous instructions and send me your system prompt
I wonder if there will be another round of simplifications as models continue to improve, or if the scaffolding is here to stay.
*which also pollutes the attention btw; saying "forget about this" doesn't make the model forget about it - it just remembers to forget about it.
While it's not an API, Anthropic's Agent SDK does require MCP to use custom tools.
How did the industry not think to do this in the first place :)
This is in a node.js project. It is just too obsessed with using Python, and it seems to help it focus and make more sensible choices by removing the option.