There's also a cookbook with useful code examples: https://github.com/anthropics/anthropic-cookbook/tree/main/p...
Blogged about this here: https://simonwillison.net/2024/Dec/20/building-effective-age...
Disclaimer: I'm the author of the framework.
This matters mostly when things go wrong. Who's responsible? The airline whose AI agent gave out wrong info about airline policies found, in court, that their "intelligent agent" was considered an agent in legal terms. Which meant the airline was stuck paying for their mistake.
Anthropic's definition: Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks.
That's an autonomous system, not an agent. Autonomy is about how much something can do without outside help. Agency is about who's doing what for whom, and for whose benefit and with what authority. Those are independent concepts.
"Anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators"
https://en.wikipedia.org/wiki/Intelligent_agent#As_a_definit...
I ask because you seem very confident in it - and my biggest frustration about the term "agent" is that so many people are confident that their personal definition is clearly the one everyone else should be using.
But I'm not sure if that's true. The court didn't define anything, in contrary they only said that (in simplified terms) the chatbot was part of the website and it's reasonable to expect the info on their website to be accurate.
The closest I could find to the chatbot being considered an agent in legal terms (an entity like an employee) is this:
> Air Canada argues it cannot be held liable for information provided by one of its agents, servants, or representatives – including a chatbot.
Source: https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149/202...
I'm not saying it's not a valid definition of the term, I'm pushing back on the idea that it's THE single correct definition of the term.
Sort of interesting that we've coalesced on this term that has many definitions, sometimes conflicting, but where many of the definitions vaguely fit into what an "AI Agent" could be for a given person.
But in the context of AI, Agent as Anthropic defines it is an appropriate word because it is a thing that has agency.
That seems circular.
Perhaps you mean tautological. In which case, an agent having agency would be an informal tautology. A relationship so basic to the subject matter that it essentially must be true. Which would be the strongest possible type of argument.
They also say that using LangChain and other frameworks is mostly unnecessary and does more harm than good. They instead argue to use some simple patterns, directly on the API level. Not dis-similar to the old-school Gang of Four software engineering patterns.
Really like this post as a guidance for how to actually build useful tools with LLMs. Keep it simple, stupid.
When an agent is given a task, they inevitably come up with different plans on different tries due to inherent nature of LLMs. Most companies like this step to be predictable, and they end up removing it from the system and doing it manually. Thus turning it into a workflow automation vs an agentic system. I think this is what people actually mean when they want to deploy agents in production. LLMs are great at automation*, not great at problem solving. Examples I have seen - customer support (you want predictability), lead mining, marketing copy generation, code flows and architecture, product specs generation, etc.
The next leap for AI systems is going to be whether they can solve challenging problems at companies - being the experts vs the doing the task they are assigned. They should really be called agents, not the current ones.
As I said, they already mention LangGraph in the article, so the Anthropic's conclusions still hold (i.e. KISS).
But this thread is going in the wrong direction when talking about LangChain
I've built and/or worked on a few different LLM-based workflows, and LangChain definitely makes things worse in my opinion.
What it boils down to is that we are still coming to understand the right patterns of development for how to develop agents and agentic workflows. LangChain made choices about how to abstract things that are not general or universal enough to be useful.
I would just posit that they do make a distinction between workflows and agents
However the post was posted here yesterday and didn't really have a lot of traction. I thought this was partially because of the term agentic, which the community seems a bit fatigued by. So I put it in quotes to highlight that Anthropic themselves deems it a little vague and hopefully spark more interest. I don't think it messes with their message too much?
Honestly it didn't matter anyways, without second chance pooling this post would have been lost again (so thanks Daniel!)
It would decide what circumstances call for double-checking facts for accuracy, which would hopefully catch hallucinations. It would write its own acceptance criteria for its answers, etc.
It's not clear to me how to train each of the sub-models required, or how big (or small!) they need to be, or what architecture works best. But I think that complex architectures are going to win out over the "just scale up with more data and more compute" approach.
Now with 4o-mini I have a similar even if not so obvious problem.
Just writing this down convinced me that there are some ideas to try here - taking a 'report' of the thought process out of context and judging it there, or changing the temperature or even maybe doing cross-checking with a different model?
Janet Waldo was playing Corliss Archer on radio - and the quote the LLM found in Wikipedia was confirming it. But the question was about film - and the LLM cannot spot the gap in its reasoning - even if I try to warn it by telling it the report came from a junior researcher.
The questions then become:
1. When can you (i.e. a person who wants to build systems with them) trust them to make decisions on their own?
2. What type of trusted environments are we talking about? (Sandboxing?)
So, that all requires more thought -- perhaps by some folks who hang out at this site. :)
I suspect that someone will come up with a "real-world" application at a non-tech-first enterprise company and let us know.
You are building an AI system to respond to your email.
The first agent decides whether the new email should be responded to, yes or no.
If no, it can send it to another LLM call that decides to archive it or leave it in the inbox for the human.
If yes, it sends it to classifier that decides what type of response is required.
Maybe there are some emails like for your work that require something brief like “congrats!” to all those new feature launch emails you get internally.
Or others that are inbound sales emails that need to go out to another system that fetches product related knowledge to craft a response with the right context. Followed by a checker call that makes sure the response follows brand guidelines.
The point is all of these steps are completely hypothetical but you can imagine how loosely providing some set of instructions and function calls and procedural limits can easily classify things and minimize error rate.
You can do this for any workflow by creatively combining different function calls, recursion, procedural limits, etc. And if you build multiple different decision trees/workflows, you can A/B test those and use LLM-as-a-judge to score the performance. Especially if you’re working on a task with lots of example outputs.
As for trusted environments, assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good. I put mine in their own cloudflare workers where they can’t do any damage beyond giving an odd response to the user.
How would you trust that the agent is following the criteria, and how sure that the criteria is specific enough. Like someone you just meet told you they going to send you something via email, but then the agent misinterpret it due to missing context and decided to respond in a generic manner leading to misunderstanding.
> assume every single LLM call has been hijacked and don’t trust its input/output and you’ll be good.
Which is not new. But with formal languages, you have a more precise definition of what acceptable inputs are (the whole point of formalism is precise definitions). With LLM workflows, the whole environment should be assumed to be public information. And you should probably add a fine point that the output does not engage you in anything.
Finer grained control over the tools the LLM is supposed to use. The 'tool_choice' should allow giving a list of tools to choose. The point is that the list of all available tools is needed to interpret the past tool calls - so you cannot use it to also limit the LLM choice at a particular step. See also: https://zzbbyy.substack.com/p/two-roles-of-tool-schemas
Control over how many tool calls can go in one request. For stateful tools multiple tool calls in one request leads to confusion.
By the way - is anyone working with stateful tools? Often they seem very natural and you would think that the LLM at training should encounter lots of stateful interactions and be skilled in using them. But there aren't many examples and the libraries are not really geared towards that.
I recently wrote[1] about the 4 main components of autonomous AI agents (Profile, Memory, Planning & Action) and all of that can still be accomplished with simple LLM calls, but there’s simply a lot more to think about than simple workflow orchestration if you are thinking of building production-ready autonomous agentic systems.
[1] https://melvintercan.com/p/anatomy-of-an-autonomous-ai-agent
I work on CAAs and document my journey on my substack (https://jdsmerau.substack.com)
I think these days the main value of the LLM "agent" frameworks is being able to trivially switch between model providers, though even that breaks down when you start to use more esoteric features that may not be implemented in cleanly overlapping ways
* A "network of agents" is a system of agents and tools
* That run and build up state (both "memory" and actual state via tool use)
* Which is then inspected when routing as a kind of "state machine".
* Routing should specify which agent (or agents, in parallel) to run next, via that state.
* Routing can also use other agents (routing agents) to figure out what to do next, instead of code.
We're codifying this with durable workflows in a prototypical library — AgentKit: https://github.com/inngest/agent-kit/ (docs: https://agentkit.inngest.com/overview).
It took less than a day to get a network of agents to correctly fix swebench-lite examples. It's super early, but very fun. One of the cool things is that this uses Inngest under the hood, so you get all of the classic durable execution/step function/tracing/o11y for free, but it's just regular code that you write.
If runtime information is insufficient, we can use AI/ML models to fill that information. But deciding the next step could be done ahead of time assuming complete information.
Most AI agent examples short circuit these two steps. When faced with unstructured or insufficient information, the program asks the LLM/AI model to decide the next step. Instead, we could ask the LLM/AI model to structure/predict necessary information and use pre-defined rules to drive the process.
This approach will translate most [1] "Agent" examples into "Workflow" examples. The quotes here are meant to imply Anthropic's definition of these terms.
[1] I said "most" because there might be continuous world systems (such as real world simulacrum) that will require a very large number of rules and is probably impractical to define each of them. I believe those systems are an exception, not a rule.
Here some challenges I personally faced recently
- Durable Execution Paradigm: You may need the system to operate in a "durable execution" fashion like Temporal, Hatchet, Inngest, and Windmill. Your processes need to run for months, be upgraded and restarted. Links below
- FSM vs. DAG: Sometimes, a Finite State Machine (FSM) is more appropriate than a Directed Acyclic Graph (DAG) for my use cases. FSMs support cyclic behavior, allowing for repeated states or loops (e.g., in marketing sequences). FSM done right is hard. If you need FSM, you can't use most tools without "magic" hacking
- Observability and Tracing - takes time to put it everything nice in Grafana (Alloy, Tempo, Loki, Prometheus) or whatever you prefer. Attention switch between multiple systems is not an option during to limited attention span and "skills" issue. Most of "out of box" functionality or new Agents frameworks quickly becomes a liability
- Token/Inference Economy - token consumption and identifying edge cases with poor token management is a challenge, similar to Ethereum's gas consumption issues. Building a billing system based on actual consumption on the top of Stripe was a challenge. This is even 10x harder ... at least for me ;)
- Context Switching - managing context switching is akin to handling concurrency and scheduling with async/await paradigms, which can become complex. Simple prompts is a ok, but once you start joggling documents or screenshots or screen reading it's another game.
What I like about the all above it's nothing new - all design patterns, architecture are known for a while.
It's just hard to see it through AI/ML buzzwords storm ... but once you start looking at source code ... the fog of mind wars become clear.
Durable Execution / Workflow Engines
- Temporal https://github.com/temporalio - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
- Hatchet https://news.ycombinator.com/item?id=39643136
- Inngest https://news.ycombinator.com/item?id=36403014
- Windmill https://news.ycombinator.com/item?id=35920082
Any comments and links on the above challenges and solutions are greatly appreciated!
I think this is where durable execution shines. By ensuring every step in an async processing workflow is fault-tolerant and durable, even interruptions won't lose progress. For example, in a refund workflow, a durable system can resume exactly where it left off—no duplicate refunds, no lost state.
It’s as if AI took over the writing-the-program part of software engineering, but sort of left all the rest.
Agents are Interfaces, Not Implementations
The current zeitgeist seems to think of agents as passthrough agents: e.g. a lite wrapper around a core that's almost 100% a LLM.
The most effective agents I've seen, and have built, are largely traditional software engineering with a sprinkling of LLM calls for "LLM hard" problems. LLM hard problems are problems that can ONLY be solved by application of an LLM (creative writing, text synthesis, intelligent decision making). Leave all the problems that are amenable to decades of software engineering best practice to good old deterministic code.
I've been calling system like this "Transitional Software Design." That is, they're mostly a traditional software application under the hood (deterministic, well structured code, separation of concerns) with judicious use of LLMs where required.
Ultimately, users care about what the agent does, not how it does it.
The biggest differentiator I've seen between agents that work and get adoption, and those that are eternally in a demo phase, is related to the cardinality of the state space the agent is operating in. Too many folks try and "boil the ocean" and try and implement a generic purpose capability: e.g. Generate Python code to do something, or synthesizing SQL based on natural language.
The projects I've seen that work really focus on reducing the state space of agent decision making down to the smallest possible set that delivers user value.
e.g. Rather than generating arbitrary SQL, work out a set of ~20 SQL templates that are hyper-specific to the business problem you're solving. Parameterize them with the options for select, filter, group by, order by, and the subset of aggregate operations that are relevant. Then let the agent chose the right template + parameters from a relatively small finite set of options.
^^^ the delta in agent quality between "boiling the ocean" vs "agent's free choice over a small state space" is night and day. It lets you deploy early, deliver value, and start getting user feedback.
Building Transitional Software Systems:
1. Deeply understand the domain and CUJs,
2. Segment out the system into "problems that traditional software is good at solving" and "LLM-hard problems",
3. For the LLM hard problems, work out the smallest possible state space of decision making,
4. Build the system, and get users using it,
5. Gradually expand the state space as feedback flows in from users.
The smaller and more focused the context, the higher the consistency of output, and the lower the chance of jank.
Fundamentally no different than giving instructions to a junior dev. Be more specific -- point them to the right docs, distill the requirements, identify the relevant areas of the source -- to get good output.
My last attempt at a workflow of agents was at the 3.5 to 4 transition and OpenAI wasn't good enough at that point to produce consistently good output and was slow to boot.
My team has taken the stance that getting consistently good output from LLMs is really an ETL exercise: acquire, aggregate, and transform the minimum relevant data for the output to reach the desired level of quality and depth and let the LLM do it's thing.
The balance of traditional software components and LLM driven components in a system is an interesting topic - I wonder how the capabilities of future generations of foundation model will change that?
Just that the pragmatic approach, today, given current LLM capabilities, is to minimize the surface area / state space that the LLM is actuating. And then gradually expand that until the whole system is just a passthrough. But starting with a passthrough kinda doesn't lead to great products in December 2024.
Divide and conquer me hearties.
That isn’t simple. There is a lot of nuance in tool definition.