AI Pioneer Shows The Power of AI AGENTS - "The Future Is Agentic"

571.9k views4301 WordsCopy TextShare
Matthew Berman
Andrew Ng, Google Brain, and Coursera founder discusses agents' power and how to use them. Be sure...
Video Transcript:
DrAndrew ning just did a talk at Sequoia and is all about agents and he is incredibly bullish on agents he said things like GPT 3. 5 powering agents can actually reason to the level of GPT 4 and a lot of other really interesting tidbits so we're going to watch his talk together and I'm going to walk you through step by step what he's saying and why it's so important I am incredibly bullish on agents myself that's why I make so many videos about them and I truly believe the future of artificial intelligence is going to be a gentic so first who is DrAndrew ning he is a computer scientist he was the co-founder and head of Google brain the former Chief scientist of Buu and a leading mind in artificial intelligence he went to UC Berkeley MIT and Carnegie melon so smart smart dude and he co-founded this company corsera where you can learn a ton about computer science about math a bunch of different topics absolutely free and so what he's doing is truly incredible and so when he talks about AI you should listen so let's get to this talk this is at seoa and if you're not familiar with seoa they are one of the most legendary Silicon Valley venture capital firms ever now here's an interesting stat about seoa that just shows how incredible they are at picking technological winners their portfolio of companies represents more than 25% of Today's total value of the the NASDAQ so the total value of all the companies that are listed on the NASDAQ 25% of that market capitalization are companies that are owned or have been owned or invested in by seoa incredible stat let's look at some of their companies Reddit instacart door Dash Airbnb a little company called Apple block snowflake vanta Zoom stripe WhatsApp OCTA Instagram this list is absolutely absurd all right another of the preface let me get into the Talk itself so a agents you know today the way most of us use l Shish models is like this with a non- agentic workflow where you type a prompt and generates an answer and that's a bit like if you ask a person to write an essay on a topic and I say please sit down to the keyboard and just type the essay from start to finish without ever using backspace um and despite how hard this is L's do it remarkably well in contrast with an agentic workflow this is what it may look like have an AI have an LM say write an essay outline do you need to do any web research if so let's do that then write the first draft and then read your own first draft and think about what parts need revision and then revise your draft and you go on and on and so this workflow is much more iterative where you may have the L do some thinking um and then revise this article and then do some more thinking and iterate this through a number of times so I want to pause it there and talk about this because this is the best explanation for why agents are so powerful I've heard a lot of people say well agents are just llms right and yeah technically that's true but the power of an agentic workflow is the fact that you can have multiple agents all with different roles different backgrounds different personas different tools working together and iterating that's the important word iterating on a task so in this example he said okay write an essay and yeah an llm can do that and usually it's pretty darn good but now let's say you have one agent who is the writer another agent who is the reviewer another for the spell checker another for the grammar checker another for the fact Checker and they're all working together and they iterate over and over again passing the essay back and forth making sure that it finally ends up to be the best possible outcome and so this is how humans work humans as he said do not just do everything in one take without thinking through and planning we plan we iterate and then we find the best solution so let's keep listening what not many people appreciate is this delivers remarkably better results um I've actually really surprised myself working these agent workflows how well how well they work other let's do one case study at my team analyzed some data using a coding Benchmark called the human eval Benchmark released by open a few years ago um but this says coding problems like given the nonent list of integers return the sum of all the odd elements are an even positions and it turns out the answer is you co snipper like that so today lot of us will use zero shot prompting meaning we tell the AI write the code and have it run on the first spot like who codes like that no human codes like that just type out the code and run it maybe you do I can't do that um so it turns out that if you use GPT 3. 5 uh zero shot prompting it gets it 48% right uh gbd4 way better 67 7% right but if you take an agentic workflow and wrap it around GPT 3.
5 say it actually does better than even gbd4 um and if you were to wrap this type of workflow around gbd4 you know it it it also um does very well all right let's pause here and think about what he just said over here we have the zero shot which basically means you're simply telling the large language model do this thing not giving it any example not giving it any chance to think or to iterate or any fancy prompting just do this thing and it got the human evalve Benchmark 48% correct then GPT 4 67% which is you know a huge Improvement and we're going to continue to see Improvement when GPT 5 comes out and so on however look at this GPT 3. 5 wrapped in an agentic workflow any of these all perform better than the zero shot GPT 4 using only GPT 3. 5 and this lb BD plus reflection it's actually nearly 100% it's over 95% then of course if we wrap GPT 4 in the agentic workflow metag GPT for example we all know about it performs incredibly well across the board and agent coder kind of at the top here so it's really just showing the power of agentic workflows and you notice that GB 3.
5 with an agentic workflow actually outperforms gp4 um and I think this has and this means that this has signant consequences I think how we all approach building applications so agents is the term has been tossed around a lot there's a lot of consultant reports how about agents the future of AI blah blah blah I want to be a bit concrete and share of you um the broad design patterns I'm seeing in agents it's a very messy chaotic space tons of research tons of Open Source there's a lot going on but I try to categorize um bit more concretely what's going on agents reflection is a tool that I think many of us are just use it just works uh to use I think it's more widely appreciated but actually works pretty well I think of these as pretty robust Technologies when I all right let's stop there and talk about what these things are so reflection is as obvious as it sounds you are literally saying to the large language model reflect on the output you just gave me find a way to improve it then return another result or just return the improvements so very straightforward and it seems so obvious but this actually causes large language models to perform a lot better and then we have tool use and we learned all about tool use with projects like autogen and crew AI tool use just means that you can give them tools to use you can custom code tools it's like function calling so you could say Okay I want a web scraping tool and I want an SEC lookup tool so you can get stock information about ticker symbols you can even plug in complex math libraries to it I mean the possibilities are literally endless so you can give a bunch of tools that the large language model didn't previously have you just describe what the tool does and the large language model can actually choose when to use the tool it's really cool use them I can you know almost always get them to work well um planning and multi-agent collaboration I think is more emerging when I use them sometimes my mind is blown for how well they work but at least at this moment in time I don't feel like I can always get them to work reliably so let me walk through these full design Pat all right so he's going to walk through it but I just want to touch on what planning and multi-agent collaboration is so planning we're basically saying giving the large language model the ability to think more slowly to plan steps and that's usually by the way why in all of my llm tests I say explain your reasoning step by step because that kind of forces them to plan and to think through each step which usually produces better results and then multi-agent collaboration that is autogen and crew AI that is a very emergent technology techology I am extremely bullish on it it is sometimes difficult to get the agents to behave like you need them to but with enough QA and enough testing and iteration you usually can and the results are phenomenal and not only do you get the benefit of having the large language model essentially reflect with different personalities or different roles but you can actually have different models powering different agents and so you're getting the benefit of the reflection based on the quality of each model so you're basically getting really different opinions as these agents are working together so let's keep listening and if some of you go back and yourself will ask your engineers to use these I think you get a productivity boost quite quickly so reflection here's an example let's say I ask a system please write Cod for me for a given task then we have a coder agent just an LM that you prompt to write code to say you def do Tas write a function like that um an example of self-reflection would be if you then prompt the LM with something like this here's code intended for a toss and just give it back the exact same code that they just generated and then say check the code carefully for correctness sound efficiency good construction CRI just write a prompt like that it turns out the same L that you prompted to write the code may be able to spot problems like this bug in line five and fix it by blah blah blah and if you now take his own feedback and give it to it and reprompt it it may come up with a version two of the code that could well work better than the first version not guaranteed but it works you know often enough but this to be worth trying for a law of appli so what you usually see me doing in my llm test videos is for example let's say I say write the Game snake in Python and it gives me the game Snake it's that is zero shot I'm just saying write it all out in one go then I take it I put it in my VSS code I play it I get the error or I look for any bugs and then I paste that back in to the large language model to fix now that's essentially me acting as an agent and what we can do is use an agent to automate me so basically look at the code look for any potential errors and even agents that can run the code get the error and pass it back into the large language model now it's completely automated coding to foreshadow to use if you let it run unit tests if it fails a unit test then why do you fail the unit test have that conversation and be able to figure out failed the unit test so you should try changing something and come up with V3 by the way for those of you that want to learn more about these Technologies I'm very excited about them for each of the four sections I have a little recommended reading section in the bottom that you know hopefully gives more references and again just the foreshadow of multi-agent systems I've described as a single coder agent that you prompt to have it you know have this conversation with itself um one Natural Evolution of this idea is instead of a single code agent you can have two agents where one is a code agent and the second is a critic agent and these could be the same base LM model but they you prompt in different ways where you say one your exper coder right code the other one say your expert code review as to review this code and this type of workflow is actually pretty easy to implement I think such a very general purpose technology for a lot of workflows this will give you a significant boost in in the performance of LMS um the second design pattern is to use many of you will already have seen you know lmb systems uh uh using tools on the left is a screenshot from um co-pilot on the right is something that I kind of extracted from uh gbd4 but you know LM today if you ask it what's the best coffee maker can do web search for some problems LMS will generate code and run codes um and it turns out that there are a lot of different tools that many different people are using for analysis for gathering information for taking action personal productivity um it turns out a lot of the early work and to use turned out to be in the computer vision Community because before large language models LMS you know they couldn't do anything with images so the only option was that the LM generate a function call that could manipulate an image like generate an image or do object detection or whatever so if you actually look at literature it's been interesting how much of the work um in two years seems like it originated from Vision because Elms would blind to images before you know GPD 4V and and and lava and so on um so that's to use in it all right so tool use incredibly incredibly important because you're basically giving the large language model code to use it is hardcoded code so you always know the result it's not another large language model that might produce something a little different each time this is hardcoded and always is going to produce the same output so these tools are very valuable and the cool thing about tools is we don't have to rewrite them right we don't have to write them from scratch these are tools that programmers already test app to use in their code so whether it's external libraries API calls all of these things can now be used by large language models and that is really exciting we're not going to have to rewrite all of this tooling and then planning you know for those of you that have not yet played a lot with planning algorithms I I feel like a lot of people talk about the chat GPT moment where you're wow never seen anything like this I think if not use planning alums many people will have a kind of a AI agent wow I couldn't imag imagine the AI agent doing this so I've run live demos where something failed and the AI agent rerouted around the failure I've actually had quite a few of them like wow you can't believe my AI system just did that autonomously but um one example that I adapted from hugging GPT paper you know you say this general image where the girls read where girl and by the way I made a video about hugging GPT it is an amazing paper I'll link that in the description below I was reading a book and it post the same as a boy in the image example le. jpack and please subcribe the new imagy re voice so give an example like this um today we have ai agents who can kind of decide first thing I need to do is determine the post of the boy um then you know find the right model maybe on hugging face to extract the post then next need to find a post image model to synthesize a picture of a of a girl of as following the instructions then use uh image to text and then finally use text to speech and today we actually have agents that I don't want to say they work reliably you know they're kind of finicky they don't always work but when it works is actually pretty amazing but with agentic Loop sometimes you can recover from earlier failures as well so yeah and that's a really important Point agents are a little bit finicky but since you can iterate and the Agents can usually recover from their issues that makes them a lot more powerful and as we continue to evolve agents as we get better agentic models better tooling better Frameworks like crew aai and autogen all of these kind of finicky aspects of agents are going to start to get reduced tremendously I find myself already using research agents in some of my work well one a piece of research but I don't feel like you know Googling myself and spend long time I should send to the research agent come back in a few minutes and see what it's come up with and and it it sometimes works sometimes doesn't right but that's already a part of my personal workflow the final design pattern multi- Asian collaboration ation this is one of those funny things but uh um it works much better than you might think uh uh but on the left is a screenshot from a paper called um chat Dev I made a video about this it'll be in the description below as well uh which is completely open which actually open source many of you saw the you know flashy social media announcement of demo of a Devon uh uh Chad Dev is open source it runs on my laptop and what Chad Dev does is example of a multi-agent system where you prompt one LM to sometimes act like the CEO of a software engine company sometimes act a designer sometime a product manager sometimes ACC a tester and this flock of agents that you buil by prompting an LM to tell them you're now coo you're now software engineer they collaborate have an extended conversation so that if you tell it please develop a game develop a GOI game they'll actually spend you know a few minutes writing code testing it iterating and then generate a like surprisingly complex programs doesn't always work I've used it sometimes it doesn't work sometimes is amazing but this technology is really um getting better and and just one of design pattern it turns out that multi-agent debate where you have different agents you know for example could be have ch GPT and Gemini debate each other that actually results in better performance as well all right so he said the important part right there when you have different agents and each of them are are powered by different models maybe even fine-tuned models fine-tuned specifically for their task and their role you get really good performance and that is exactly what a project like crew AI like autogen is made for so Gabby multiple simulated air agents work together has been a powerful design pattern as well um so just to summarize I think these are the these are the the the uh patterns I've seen and I think that if we were to um use these uh uh patterns you know in our work a lot of us can get a prity boost quite quickly and I think that um agentic reasoning design patterns are going to be important uh this is my small slide I expect that the set of task AI could do will expand dramatically this year uh because of agentic workflows and one thing that it's actually difficult people to get used to is when we prompt an LM we want to response right away um in fact a decade ago when was you know having discussions around at at at Google on um called a big box search type in Long prompt one of the reasons you know I failed to push successfully for that was because when you do a web search you one have responds back in half a second right that's just human nature we like that instant gra instant feedback but for a lot of the agent workflows um I think we'll need to learn to dedicate the toss and AI agent and patiently wait minutes maybe even hours uh to for response but just like us I've seen a lot of novice managers delegate something to someone and then check in five minutes later right and that's not productive um I think we need to it be difficult we need to do that with some of our AI agents as well all right so this is actually a point which I want to pose a different way of thinking about it think about grock grock grq you get 500 700 850 tokens per second with grock with their architecture and all of a sudden the agents which you know you usually expect them to take a few minutes to do a semi complex task all the way up to 10 15 20 minutes depending on what the task is a lot of the time in that task completion is the inference running that is assuming you're getting you know 10 15 20 tokens per second with open AI but if you're able to get 800 tokens per second it's essentially instant and a lot of people when they first saw grock they thought well what's the point of 800 tokens per second because humans can't read that fast this is the best use case for that agents using hyper inference speed and reading each other's responses is the best way to leverage that really fast inference speed humans don't actually need to read it so this is a perfect example so if all of a sudden that part of your agent workflow is extremely fast and then let's say we get an embeddings model to be that fast all of a sudden the slowest part of the entire agent workflow is going to be searching the web or hitting a third party API it's no longer going to be the inference and the embeddings and that is really exciting let's keep watching the end and then one other important Trend fast token generation is important because with these agentic workflows we're iterating over and over so the elm is generating tokens for the to read and I think that um generating more tokens really quickly from even a slightly lower quality LM might give good results compared to slower tokens from a betm maybe it's a little bit controversial because it may let you go around this Loop a lot more times kind of like the results I showed with gpdc and an agent architecture on the first slide um and cand I'm really looking forward to Cloud 5 and Cloud 4 and gb5 and Gemini 2.
Copyright © 2025. Made with ♥ in London by YTScribe.com