Anthropic Revealed Secrets to Building Powerful Agents

5.76k views3436 WordsCopy TextShare

Matthew Berman

What makes for a good AI agent? Watch to find out! Try Vultr yourself when you visit https://getvul...

Video Transcript:

I'm going to tell you how to build effective agents anthropic the company behind the Claud family of models just dropped a bunch of information about how to build effective models and I read through it and it's actually really good we're going to go over it together and I'm also going to give my thoughts after building a bunch of Agents myself so let's get into it this video is brought to you by vulture the easiest way to power your generative AI startup with the latest Nvidia chips check out vulture I'll drop a link in the description

below so this is the blog post just about a week ago it came out building effective agents now the first thing it starts out with is saying that you don't need complex Frameworks to build agents and that's true look at custom gpts from chat GPT those are effectively agents you choose a personality you choose a role you give it tools you give it memory and that constitutes an agent in my mind but that's the most basic form of an agent and I personally use crew aaii I'm an investor in them so of course I use

them I also think they're the best agentic framework and so if you need anything more sophisticated than just defining a single agent agentic Frameworks are really really powerful and these agentic Frameworks are getting more powerful by the day and specifically what they say here is consistently the most successful implementations weren't using complex Frameworks or specialized libraries instead they were building with simple composable patterns now as these agentic Frameworks mature you're going to want to choose one of those you don't want to reinvent the wheel each time you don't want to try to figure out what

the best patterns are each time that's the point of a framework that is why we've had Frameworks for code forever so next they go over what is an agent and a lot of people have different definitions of what an agent is I personally think an agent is essentially the core llm the intelligence wrapped with memory tools and the ability to collaborate with other agents anthropic says some customers Define agents as fully autonomous systems that operate independently over extended periods using various tools to accomplish complex tasks others use the term to describe more prescriptive implementations that

follow predefined workflows at anthropic we categorize all these variations as agentic systems but draw an important architectural distinction between workflows and agents workflows are systems where llms and tools are orchestrated through predefined code paths agents on the other hand are systems where llms dynamically direct their own processes and Tool usage maintaining control over how they accomplish tasks now the best agentic framework blur the lines between workflows and agents there are certain parts of agentic systems where you want more structure and you want to use predefined code to help these agents along and then there are

other parts where you just want them to think and create and to work with other agents to come up with the best solution and I really think as I said the best agentic Frameworks really blur this line or make it easy to use both in a single use case next they go over when to use and when not to use agents and this first sentence I could not agree more with when building applications with llms we recommend finding the simplest solution possible and only increas in complexity when needed so important that not only applies to

agents that not only applies to code that is a life lesson when building any system start with the simplest implementation possible and only add complexity when necessary a gench systems often trade latency and cost for better task performance and you should consider when this trade-off makes sense so the more sophisticated and more complex your agent usage is it's going to use more tokens and it's going to take more time now solving that latency and cost is also something that you can work on with your agentic system this is why I'm also an investor in Gro

grq because they have really cheap insanely fast inference speed so all of a sudden when you plug that into an agentic system the latency and cost part of the function becomes less critical so they double down they say workflows offer predictability and consistency for well-defined tasks whereas agents are the better option when flexibility and model driven decision-making are needed at scale when and how to use Frameworks so they talk about different Frameworks Lang graph from Lang chain AI agent framework from from Bedrock rivet Vellum now I don't know why they didn't mention crew because crew

is basically the biggest framework out there but fine and they also talk about why Frameworks should be used and they offer a layer of abstraction they come with a bunch of built-in tools it's basically a predefined golden path you don't have to think about the best practices as much of course you have to learn the framework but you just don't have to figure out a lot of the ancillary issues as you go now now of course there are downsides to using agentic Frameworks they often create extra layers of abstraction that can obscure the underlying prompts

and responses making them harder to debug and it also makes it tempting to add more complexity when it's not necessary just cuz it's so easy to do so so we're going to go over some examples of what an agentic system looks like starting from something pretty simple anthropic specifically talks about when a problem is this simple their core models their base models have all the necessary functionality to solve it without needing any kind of additional framework so here's what we're seeing here you have in so that's the prompt then you have the llm the llm

through the Claude family of models has the ability to get search results so retrieval has the ability to use tools and has memory and then you have the output so after it decides what it needs to use then it gives you the output now these base models are getting better and better and really what that means is the model providers are baking in more and more agentic functionality directly into that base model which is good now I didn't cover this much but there was a release by anthropic and by the way anthropic is completely on

fire lately but they released something called the model context protocol and it is a framework for allowing llms to interact with third-party tools and it's essentially a definition for how to do so so here they describe it as allowing developers to integrate with a growing ecosystem of third party tools with they simple client implementation let's look at a workflow now and this is a workflow for prompt chaining so let's read a little bit about it so prompts chaining decomposes a task into a sequence of steps where each llm call processes the output of the previous

one you can add programmatic checks on any intermediate steps to ensure that the process is still on track so why would you need this it sounds cool but why would you need it well the best way to explain it is when you have a situ situation in which it's complex and can be broken down into really modular pieces so if you try to get a model to just accomplish this multistep task all in one go you're basically not going to get as high of quality and so the trade-off here is to do each step independently

feed it to the next prompt and then allow it to build off of that and so as it says here you're trading latency for Quality so here's the example that they give generate marketing copy then translating it into a different langu language these are two obvious different steps so first you would generate the marketing copy then you would pass that to another prompt and allow it to do the translation separately so it's not trying to do both of those tasks in a single go another workflow example is routing routing is incredibly powerful you can have

very specialized agents waiting to accomplish their task and they can be very different in what they're able to accomplish and the cool thing is you can send a prompt and have a router decide which agent is most appropriate for whatever the task is so what kind of tools does it require what kind of expertise does it require what kind of model does it require of course because you can have different models for different agents as it says here routing works well for complex tasks where there are distinct categories that are better handled separately and where

classification can be handled accurately either by an llm or a more traditional classification model algorithm now one example of routing that they give here I think is absolutely brilliant and so much so I invested in a company that does exactly this so here they say routing easy common questions to smaller models like Cloud 3.5 Hau and hard unusual questions to more capable models like Cloud 3.5 Sonet to optimize cost and speed so if you want to give the right prompt to the right model at the right time routing is a good way to do that

and the company that I'm referring to is not not Diamond so they basically take your prompt decide which of dozens of different models is most appropriate based on cost latency and quality and then they will tell you which one to use so big cost savings big latency Improvement and just overall quality improvement as well but of course you can create a simplified version of a routing algorithm yourself just using a model another workflow is parallelization if the order of operations doesn't matter at a certain step you can have multiple agents working in parallel to decrease

the latency of the completion of the task and I hadn't thought of this but they described two different variations of parallelization first sectioning breaking a task into independent subtasks run in parallel simple but then they also have voting running the same task multiple times to get diverse outputs essentially Chain of Thought reasoning this is a very basic explanation for how the thinking models work you come up with a bunch of different variations and then you figure out which one is best and then you continue from there so again you can create this yourself these are

all very simple patterns in theory of course they get more complex as you actually build them and productize them but in theory they're pretty simple so when would you use parallelization as a workflow so they give two examples for each type of parallelization so first sectioning implementing guard rails where one model instance processes user queries while another screens them for inappropriate content or requests this tends to perform better than having the same llm call handle both guard rails and the core response now if you remember back to the llm jailbreaking game video that I make

where essentially the game was there was this crypto wallet and if you convince the llm to send money out of the wallet you would win whatever was in the wallet really cool video by the way I'll drop it in the description below they did not do that they had one model handling all of it but as I kind of suggested in the video if they had another model just double checking everything so one model to actually handle the sending and receiving of crypto and then another model to actually be the guard rails to make sure

it's not sending crypto it would have been much more powerful next in sectioning is automating EV valves for evaluating llm performance where each llm call evaluates a different aspect of the model's performance on a given prompt very straightforward so you have one model generating The Prompt and one model evaluating it and then for voting reviewing a piece of code for vulnerabilities so when you have one model creating the code and another model evaluating that code you get much better quality code and this is something that we've all seen the first generation of code from a

model is sometimes okay sometimes it's not great but if you run the code give it the errors and then allow it to fix its errors it becomes better much more quickly and what I've also noticed about models in general they're actually much better at evaluating things than generating things and that's not a hard and fast rule whatsoever but generally what I've seen is models are much more accurate at evaluating something than generating that thing and then of course evaluating whether a given piece of content is appropriate then another workflow you have orchestrator workers so this

is something that I've used quite a bit with crew aai all of these patterns are available in crew aai of course and most other agentic Frameworks that you might want to use now what I really like about the orchestrator pattern is that it's able to get a result from an llm it being the orchestrator decide what to do with that result and then potentially send it back for another iteration or send it off to another agent to complete the next step so in their definition in the orchestrator workers workflow a central llm dynamically breaks down

tasks delegates them to worker llms and synthesizes their results this is probably the workflow that I use most often and I'll tell you specifically what I built with this workflow so I uploaded a PDF I asked an agent to create a bunch of truthful questions and answers and then I had a reviewer agent make sure all of those questions and answers were accurate so what I found is if I created 30 questions and answers two of them might be hallucinated meaning they either didn't exist or they were just wrong so I had another agent check

it evaluations then I had the orchestrator agent send it back to the question and answer generator agent to generate two more and then send it back to to the orchestrator and this all happened very seamlessly it was really cool so when to use this workflow there's two examples that they give coding products that make complex changes to multiple files each time and search tasks that involve gathering and analyzing information from multiple sources for possible relevant information that is obviously the one that I created then another workflow evaluator Optimizer again another Super common pattern for AI

workflows one llm calls a response while another provides evaluation and feedback in a loop so as we can see in this diagram we have the prompt we have the llm call generator generates a solution the evaluator evaluates it and either accepts it and gives it or rejects it and says generate another one this is exactly what I did with the question and answer except it used evaluator Optimizer and then it also used the orchestrator so it was both of those patterns together so a couple of examples where this eval Ator pattern is useful when you

have a clear evaluation criteria and when iterative refinement provides measurable value so first an example literary translation where there are nuances that the translator llm might not capture initially but where an evaluator llm can provide useful critiques by the way one of the most powerful ideas and I've said this already I'm going to say it again one of the most powerful ideas in all of AI right now is the evaluation pattern the idea that the first generation of a response to a prompt is not usually going to be the best and that is literally what

the thinking model does it generates a bunch of different outputs votes on the best one then iterates on it and it's just this really long generation evaluation iteration cycle then a second example is complex search tasks that require multiple rounds of searching and Analysis to Gather Comprehensive information where the evaluator decides whether further searches are warranted next anthropic goes on to just describe more about a agents and I want to pull out a couple key sentences and ideas from this first let me show you this line agents begin their work with either a command from

or interactive discussion with the human user now that might not always be the case I guess technically an agent will always have to start with a user because even if you set it to be completely autonomous you're still setting it to do so but still the way that they're describing it is you're either saying okay go do this thing or programmatically kicking off that agent to go do a thing and they also talk a little bit about human in the loop which is another powerful idea for agentic Frameworks at which point is it critical that

a human reviews the output or a human makes the decision and also are the agents capable of including a human in the loop at the right time so once the task is clear agents plan and operate independently potentially returning to the human for further information or or judgment agents can then pause for human feedback at checkpoints or when encountering blockers the task typically terminates upon completion so when do we use agents agents are good for open-ended Problems by the way everything we've talked about previously have been workflows now we're talking about agents agents can be

used for open-ended problems where it's difficult or impossible to predict the number of steps and where you can't hardcode a fixed path the llm will potentially operate for many returns and you must have some level of trust in its decision-making so the following examples are from our own implementations of Agents a coding agent that resolves sbench tasks by the way if you haven't seen the interview with the sbench team it's really interesting I'll drop it down below which involve edits to many files based on a task description or our computer use reference implementation where Claude

uses a computer to accomplish tasks so here's the highle flow of a coding agent we have a query from the human basically here's what I want you to do the interface where the agent is clarifying and refining what the human wants it to do so basically defining it much more thoroughly then you send all the context to an llm then within the environment the coding environment you could do a bunch of things so these are all just tools essentially so search files return pass write code status test results and then we complete and display the

results to the human now here's the important part and something I already touched on these building blocks aren't prescriptive they're common patterns that developers can shape and combine Bine to fit different use cases the key to success as with any llm feature is measuring performance and iterating on implementations that is probably the most important thing in this entire document what I have found more than anything is especially because everything is so early right now is that you just have to test it test it test it test it use observability tools use agentic Frameworks when you

can when it's necessary and just test a lot of different things run a bunch of different tests Benchmark it a lot of these agentic Frameworks including crew include benchmarking as part of their core functionality so definitely just test everything you can and you're going to discover different patterns that work well for you work well for the tasks that you need accomplished so I want to say thank you to Eric schluns and Barry Zang who wrote this this is a great starter article for how to think about agents I want to do more educational material around

agents so let me know if you want to see that in the comments down below if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one