okay cool hi uh my name is sh super glad to be here and talk to you about LM agents uh brief history and overview so today's plan is very straightforward uh I want to talk about three things so first what is LM agents to start with and then second I want to talk about a brief history of LM agents both in the context of LM and in the context of agents and lastly I want to share some ideas on some future directions of Agents so uh as you know this field is a moving piece and
it's very big and messy so it's impossible to cover everything in agents so uh I I just try to do whatever is in the best of M and you can see there's a QR code and you can scan scan it and give me feedback and I can improve the talk accordingly okay so let's get started uh first what is LM agents uh do anyone know do anyone think they know the answer if so raise a hand like do you have a definition for what is L on agent okay uh there are like maybe three people
so uh that means this field is really a moving piece so I think if we want to Define what is LM agent we want to First Define the two components what is LM and what is and uh does everyone know what is Ln okay uh so so so really what left what's left is we need to Define what's agent and if you search Google image uh this is this is Agent right uh but but in the context of AI obviously uh we know it's a notoriously broad term it can refer to a lot of different
things right from autonomous car to go to play video game to chat B so uh first what exactly is the agent so um my definition is that it is an intelligent system that can interact with some environment and depending on like different you can have different agents right you can have physical environments such as robot autonomous so on and you can have agents that interact with digital environments such as video games or iPhone and if you count human as an environment then chatbot is also some kind of agent and uh if you want to Define
agent you really need to Define what is in intelligent and what the environment and what's really interesting is that throughout the history of AI the definition of what intelligent often changes across time right so like like like 60 years ago if you have a very basic chat bot using like three lines of rule then it can it can be seen intelligent but right now even CH is not surprising anymore so uh I think a good question for you all is like how do you even Define intelligence okay so let's say have some definition of agent
then what is LM agent so I really think there are three uh categories or three uh Concepts so I think the uh first level of concept is what is a Tex agent and I think test agent is defined as uh so you have this agent interacting with the environment and if both the text and act observation is in language then it's a text so obviously uh you can have text agent that's not using LMS and in fact we have tax agent from the beginning of AI like several decades ago and I think the second level
of definition is LM agent which is agents that are Tex agent that are also using LM to act right so uh and I think the last level is uh what I call reasoning agent and the idea is you uh those agents use LMS to reason to act and and uh so right now you might be confused what is the difference between the second level and the third level uh which I will explain later so like I said uh people have been developing tax agent from the beginning of AI right so uh for example like back
in 1960s there has already been chat bot so Elisa is one of the earliest chat Bots and the idea is really simple you just have a bunch of uh and what's really interesting is that using like a bunch of rules you can already make a chat out that's uh that's quite human right and what it does is it keep asking you questions or repeating what you said and and people find it very human and uh uh but obviously like there are limitations to those kind of rule based agents as you can see like if you
want to design rules then it often is very task specific and for each new domain you need to develop uh some new rules right and and lastly like those rules not really work Beyond a simple domain right suppose you write many rules to to build a chat bot but then you need to write many rules for a video game agent and so on and so forth so uh before alms there's another very popular Paradigm which is to use RL to build tax agent and the idea is uh I'm sure everybody have seen video games right
so you can imagine text games where instead of pixels and keyboard you're using text as observation and action and you similarly have rewards you can similarly use reinforcement learning to optimize the reward and idea is you can just optimize the reward and you you exibit some kind of uh language intelligence but again uh this kind of method are pretty domain specific right for each new domain you need to train a new agent and it really requires you to have skill reward signal for the task at hand which many of the task don't and lastly it
takes extensive training which is a feature of RL so really if you think about the promise of LMS to revolutionize Tech agent right uh this LMS are really just train on next token prediction on massive Tex Cora yet during time it could be prompted to solve various new tasks so uh this kind of generality and FAL learning pH will be really exciting to build agents so uh next I want to give a a brief overview ofm agents and it's also like a historic View and uh it's obviously very simplified so um I think what what's
happening is so first we have something like LM in 2020 I think the beginning of M is GB3 and then people start to explore that across different tasks and some tasks happen to be reasoning task such as symbolic question answering and so on and some tasks happen to be what I call acting task you can think of games or Robotics and so on so forth and then uh we find that this Paradigm of reasoning and Paradigm of acting are start to converge and we start to build what I call reasoning agent that's actually quite different
from all the previous agents and from reconing agent uh we start to explore uh on one hand more interesting applications and tasks and domains such as web interaction or self engineering or even scienic Discovery and so on and on the other hand we start to explore new methods such as memory or learning or planning or multiv and so on so uh first I want to uh introduce you know what I mean by the Paradigm of reasoning and what I mean by the Paradigm of acting and how they converge and what is this Paradigm of reasoning
agent and uh history is always massive so for now let's just assume uh let's focus on one task which is question answering which can simplify our uh history discussion a little bit and then we'll come to more tasks so question answering is a very intuitive task right so uh if you ask a language model what is 1 plus two it will tell you three right that's question answering it's very intuitive uh so it's also happens to be one of the most useful task in NLP right so obviously people try to use language models to do
question answering and then people find a lot of questions a lot of problems when you try to answer questions okay so if you have some question like this uh it will be very hard for the Transformer language model to just output the answer directly right so it turns out you need some reasoning and uh then covering the last talk like Chain of Thought reasoning and so on and so forth there has been a lot of people investigating how to do better reasoning with language models uh you can also Imagine like language model trying to answer
something like this and it will probably give the answer wrong because uh for example if a language model is trained before 20124 and the prime minister of UK change often as you know so it might get a answer wrong right so in that case you you need new knowledge and people are working on that uh and for another example like you can ask something that's really mathematical and really hard and in that case like you cannot really expect Transformer to give the answer right so in some sense you need some way of doing computation Beyond
uh beyond the laive auto regression of Transformer so as you can see like there are many types of question answering tasks and people find many problems when using language model to to answer those questions and then people come up with various Solutions so for example if you're trying to uh solve you know the the problem of computation right what you can do is you can first use the language model to generate a program and then this program will run and give you a result that's the way you can answer you know question about prime factorization
or was a a 50 fibach number so for the for the problem of knowledge uh there's this Paradigm of retrieval argumented generation right and and the idea is uh very simple right you you have you assume you have some extra Cora for example Wikipedia or uh this uh COA of this company for example and then you have a retriever whether it's a bm25 or DPR or or so on and so forth uh you can think of retriever as kind of a search engine right so what it does is given a question this retriever will actually
just pull the relevant information from the cpra and then append that to the context of the language model so that it's much easier for the language model to to answer the question uh so this is a very good pattern however uh what if there's no ca for the knowledge or information that you care about right for example if I care about to this's weather in San Francisco it's very hard to expect any existing C to have that right so uh people also find this solution called to use and the idea is uh you have this
natural form of generation which is to generate a sentence but then you can introduce some special tokens so that it could invol tool calls right for example you have a special token of a calculator or a special token for a Wikipedia search or a special token for uh calling a weather API um so this is very powerful obviously you can argument uh language models with a lot of different uh knowledge information and even computation but if you look at this this is not really a very natural format of text right there's no like blog post
or a Wikipedia passage on the internet that looks like this so if you want language model generate something like this you have to fine-tune that in this very specific format and it turn are very hard to call that uh uh more than once across the uh across the text so uh another natural question come right what if you need both reasoning and knowledge and people actually come up with a bunch of solutions for a different tasks right for example you can imagine interleaving the Chain of Thought and retrieval or or generate followup questions and so
on so forth um but uh with need to get into the details of all the mths I just want to point out like uh the the situation at the time was a little scattered right so you have this single task called QA when it turns out to be more than a single task you actually have like tons of different benchmarks right and they happen to challenge language models in very different ways and and people come up with uh solutions for each of the Benchmark so it's it feels very piz at least for me right and
at least for me at the time uh the question is uh can we really have a very simple and unifying solution and I think if we want to do that we really need abstraction Beyond you know individual task or mthod we need like a higher level abstraction over what's happening so the abstraction that I find at least for myself is the obstruction of reasoning and acting so what is reasoning uh hope you already know that from Denis talk last time uh so Chain of Thought right uh it's very intuitive and it's just a very flexible
and general way to to argument test time compute and to syc for longer uh during inference time to solve more complex questions right however if you only do Chain of Thought uh you don't have any external knowledge or tools right even the biggest smartest model in the world does not know the weather in San Francisco today so if you want to know that you need external environment knowledge and tools and what I have described as like ride or retrieval or code or tool use and so on and so forth in some sense it's just a
paradigm of acting because you're just assuming you are having an agent and you have various environments whether it's retrieval search engine calculator API or python right um and the benefit of interacting with the external environment is that that it's very it's like a flexible and general way to argument knowledge and computation and feedback and so on and so forth however uh it doesn't have reasoning and we will see later why that's uh Troublesome so uh the word the idea of this word called react is actually very simple right uh so you have this two Paradigm
recently acting uh and before react like language models are either generating reasoning or acting and for react the idea is to just generate both and we will see that it's actually a great way to synergize Po in the sense that reasoning can help acting and acting can help reasoning and it's actually quite simple intuitive you will see later it's actually you can argue that's how like I solve the task or you solve the task it's a very human way to solve the task and it's very general across the domain so the idea of react is
very simple right suppose you want to a task and what you do is you write a prompt and the prompt consists of a trajectory that looks like this so you give a example task and as a human you just write down how you think and how you do what you do to solve the task along with the observation along the way right so if you're trying to answer a question using Google engine you just think about some stuff and do some search and then you write down that and they also write down the result Google
and you keep doing that until you solve the task uh you can give this one example uh and then you can give a new task and given this prompt language model will generate a thought and action and this action is parsed and fit into the external environment and then that would figure some observation and then the Sol action observation is appended to the context of the langage model and then language model generate the new Sol and new action and so on so forth so obviously you can do that using a single example that's called One
Shot prompting you can do that with few examples that's called f sh prompting if you have many many examples you can also find T Lang model to do that so it's really about a way to use langage model rather than a prompting or fine-tuning so as a concrete example uh let's say you want to answer a question right if I have S trillion dollars uh can I buy Apple Nvidia and Microsoft uh I make this slide back in March and that was a trendy Topic at the time uh so you can write down a prompt
like that you just say Okay language model now you're agent and you can do two type of actions you can either Google or you can finish with the answer and you just need to write on the Sol an action okay so that's very intuitive and let's say what language model do right so this is what gbd4 does back in March so it first generat start right so first I need to understand I need to find what is the market cap of those companies and then add them together so that I can determine if7 trillion dollars
can buy all three companies um and then this triggers this action to search on Google and this Google Search returns this snamp as a result unfortunately like it just contains all the market caps you need so uh the react AED that and now I have the other market cap all I just need to do is to add them together so it uses search engine as a calculator add them together and get a result and it will think okay uh so $72 is not enough so you need additional money to buy that I think if it's
us today then it's even more money because Nvidia is much higher now uh yeah so that's how react solve the Tas and you can see it's a very intuitive way very similar to how humans solve the Tas right you think about the situation you do something to get some more knowledge or information and then based on information you think more uh and then I try to be a little more adversarial so uh instead of finding all the market caps uh I inject this adversar observation right nothing is fine and here comes the power of reasoning
right so reasoning actually finds a way to adjust the plan and guide the action to adapt to the situation right because the research is not result is not fine maybe I can search for individual market cap right so I can just search for the market cap of Apple and then I try to be out zero again I give the stock price inside of the market cap and here reasoning helps again right based on common sense it figures out right this is probably the price not the market cap so if you cannot find the market cap
what you can do is you can find the number of shares and then you can multiply number of shares and the market and the stock price to get a market cap and then you can do that for the uh all three companies and then you can solve the task so uh from the example I you can see that it's not really acting helping reasoning right obviously acting is helping reasoning to get realtime information or doing calculation in this case but also reasoning is constantly guiding the acting to plan the situation and rep situation based on
uh exceptions so uh you can imagine uh something like this to solve various question answering task all you need to do is to provide different examples and provide different tools so uh okay this is good uh we're making progress uh but I think what's really cool is uh like this param goes beyond QA right so if you think about it you can literally use it to solve any task and and to realize this all you need to realize is a many task can be turned into a text game so imagine if you have a video
game what you can do is you can assume you have a you know video captioning or image capturing model and you can have some controller that can turn language action into a keyboard action and then you can literally turn many of the task into a text game and then you can literally use react to solve them so uh it goes well beyond uh press answering so uh after you know the the invention of n model obviously another part of the history is uh the are people from race for learning robotics video game so on and
so forth they're trying to apply this technique and there are many works and I'm only listing one for example and the idea is uh very uh intuitive like I said you can try to turn all the observation into text observation and then you can try to use Lage model to generate text action and then you turn the text action into some original format of action and then you solve the solve the task but what what's the issue of this right so uh this is an example from a video game where you're trying to do some
household task in a kitchen right and uh the the problem really is sometimes it's really hard to directly map observation into the action because uh for for one you you may have never seen the domain second like to to to process you know like from the observation to action you need to think but if you don't have this thinking paradigm like all you're are doing is just uh trying to imitate the observation to action mapping from the from the prompt or from the future example so in this case uh in the syn Bas in one
there is no paper shacker so nothing happens but uh because it doesn't have the capacity to sync it will just keep doing that and keep feeling because uh like it's it's like language model it's just trying trying to imit it so it's not really trained to solve the task like agent so what we us is actually something very simple right you are literally just adding another type of action called thinking and thinking is a very interesting action because you can think about anything right so in this video game you might only be able to go
somewhere or pick up something that's the action space defined by the environment but you can think about anything and you can see that this sying action is very useful because it help you plan the situation it help you keep track of the situation and help you plan rep plan if something wrong happens so as you can see like react is a general pattern that helps uh across various task and is systematically better than if you only do reasoning or only do AC so um this is interesting right and uh I just want to point out
why this is interesting from a more theoretic perspective so uh again abstraction right so if you think about all the agents that you have everything right from video game to Alpha go to autonomous car whatever like all the agents uh like one common feature is that you have action space that's defined by the environment right so assume you're solving video game say Atara game then your action space is left right up down like you can be very good you can be very bad but your action space is fixed and what's really different for language agent
or LM agent or reasoning agent is that uh you have this argumented action called reasoning and what's really interesting about this argumented action is that it could be any language right you can think about anything it's a INF space you can think about a par you can think about a sentence you can think about a word you can think about 10 million tokens and it doesn't do anything to the to the word right no matter what you think it doesn't really change uh the Earth or or the video game you're playing all it does is
it changes your own context right it Chang changes your memory and then based on that it changes their followup actions so uh so that's why I think this new paradigm of rising agent it's different it's different because reasoning is an internal action for agents and reasoning has a very special property because it's an infinite space of language cool um so we've covered the most important part of the talk uh I think the history goes on right so from now on we have the Paradigm of reasoning agent and then we have more me more task and
there's a lot of progress obviously and I cannot cover everything so on the methodological side I just want to cover one thing today uh which is longtime memory so we just talk about what is resing Agent and the idea is uh you have an external environment be it video game or uh Google search engine or or your car or whatever and we just talk about the difference reasoning agent is that the agent can also think right another way to to think about this is uh you have an agent that has a shter memory which is
the context window of the language model and it's interesting that uh you can append interesting thoughts and actions and observations to to this context but uh if you look look at this uh contest window of the langage model first is append only right so you can only app new tokens to the context and you have limited context right so it could be a, token two years ago it could be a million token now it could be 10 million tokens uh next year but you have a limited uh size of context and even let's say we
have a 10 million uh token window uh you might have limited attention right so you can have uh a lot of distracting things if you're like doing a long Horizon SC right and uh lastly uh it is a shter memory because this kind of memory does not persist over time or over new task right so you can imagine uh let's say this agent solved REM hypothesis today which is really good but then unfortunately if you don't find tun language model right it doesn't change right so next time we have to solve from scratch again and
there's no guarantee whether it will solve tomorrow right so U I think an analog I want to make is it's kind of like a golden fish right so uh folk wisdom is a golden fish only has three seconds of memory right so like you you can you can solve something remarkable but if you cannot remember it then you have to solve it again and it's really a shame right so uh hope that's motivating enough to introduce this concept of not memory right so it's just like as a human right you cannot remember every detail in
the in every day right but you maybe you you may write a diary right that's kind of like a long memory you read and write important stuff for your life for your future life like important experience important knowledge are important skills and hopefully that should persist over new experience right so you can also Imagine a mathematician right writing a paper on how to prove REM hypothesis that's kind of like a l of memory right because then like you can just read the paper and you can prove it you you don't have to solve it again
so uh let's look at a very very very simple form of L memory in this word called reflection which is a very simple followup from react so let's say you're are trying to solve a coding task right this is a task and you can imagine you can write some program you can you can run a program you can you can reason you can do whatever uh but at the end of the day right right you tested and let's say it doesn't work right some some some some tests failed so if you don't have a lter
memory then you just have to try again right um but what's different now is if you have a long memory what you can do is you can reflect on your experience right so if you read a program and it filled some test you can think about it right it's like oh I I feel this task because I forget about this col case so if I write this program again I should remember this and what you can do is you can persist this piece of information over time like when you write this program again you can
literally read this lter memory and then you can try to be better next time uh and and hopefully it will improve so this turns out to be working really well for various tasks but particular coding right because for coding you have uh great feedback which is the unit test result and you can just keep reflect on your failure or success and then you can keep track of the experience as a sort of not memory and then you can get better another way to think about this is uh it's really a new way of doing learning
right so if you think about the traditional uh form of reinforcement learning right so uh you do something and then you get a skill reward and and what you do is essentially trying to back propagate the reward to update the ways of your policy and there like many many algorithms to do that if you think about uh reflex sound right it's really a different way of doing learning because first you're not using skill reward you can use anything right you can you can use a code execution result you can use a compiler Arrow you can
use the feedback from your teacher which is in text so on and so forth and it's not doing learning by ging descent right it's learning by updating language right by language I mean a lot of memory of task knowledge and then uh you can think of this language as affecting the future behavior of the policy right so this is only a very simple way simple form of L memory and I think fullop work did more complicated stuff you will hear about Voyager from JY uh later I guess where you have like a memory of code
based skills right and the idea is for example you're trying to play Minecraft and you learn how to uh build a sword in in this kind of API code then you can you can try to remember it the next time if you want to kill a zombie you can first pull the skill of building a sword you don't have to try to try it from scratch right and uh for example in this work of generative agents uh the idea is you have like 20 human like agents in this small town trying to be human you
know they have jobs they have life they have social interaction so on and so forth uh you have this episodic form of longterm memory where you literally each agent keeps a log of all the events that's happened right every hour right that's like a most detailed possible diary you can possibly have and you can imagine like like later if you want to do something you can try to look at the log to decide what to work on right because if if you like drop off your kit at this place you want to retrieve that piece
of information and then you pick it up uh you can also have this of sematic memory where uh you can you can look at your diary right and you can draw some conclusions about other people and yourself right you can realize you can reflect on that and you can say okay okay a Jim is actually very curious guy and I actually like video game and this kind of knowledge can actually affect your behavior later uh yeah so this is L memory and um and I I think another the final step to to finishing this part
is to realize that uh you can actually also think of the language model as a form of longterm memory right so uh you can learn by learn I mean improve you can improve yourself or you can say you can change Yourself by either changing your parameters of the new network which is to find to your language model or you can some piece of code or language or whatever in your L memory and then you can retrieve from it later right so uh that's just two ways of learning but if you think of both the new
network and the and whatever uh text CA as Bose a form of L memory then you have a unified abstraction of learning and uh then you have an agent that has this power of reasoning over a special form of short short term memory called contest window of the L model and then you can have various form of long memory and in fact you can show that this is almost just sufficient to express any agent so uh I have this paper called koala which I don't have time to cover today but I encourage you to check
out where uh the student is that you can literally just express any agent by the memory which is where the information is stored the action space like what the agent can do and the decision- making procedure basically given the space of actions what what which action you want to take right you can literally express any agent with these three parts so this is a very uh clean and sufficient way of thinking about uh any agent and uh uh I want to leave two questions for you to think uh and I have answer in this paper
that you can try to retrieve uh so the first question is what what makes external environment different from internal memory right so imagine if the agent opens up Google doc and write something there is that a form of longterm memory or is that like some kind of action to change the external environment or like imagine if the agent has a Archive of internet right and it tries to retrieve some knowledge from there is that a kind of action or is that a kind of retrievable from a longterm memory I think this question is interesting because
if you think about physical agents like humans are autonomous cars right it's very easy to Define what is external and what is internal because for us like what's outside our skin is external what's inside our skin is internal right it's very easy to Define but uh I want you to think about for digital agents how can you even Define that and lastly like how do you even Define longterm memory versus shortterm memory like suppose you have a language model context of 10 million tokens Can can that still be called a l memory uh know that
those terms are defined from kind of human Psychology and Neuroscience and uh think about this two questions so okay so we have covered some brief history of elm agents uh I also want to uh talk about you know uh the history of LM agents in the broader context of Agents right we have talked about how we start from LM to derive various things and other developments of language agents uh but if you look at a more Asian history how is reasoning agent different from all the previous paradigms of agent so here I want to give
a very very minimum history of agents and it's definitely wrong so it's just for uh illustration right don't take that too seriously but I think if you want to write a very minimal history of agents in one slide uh at the beginning of AI right the Paradigm is called symbolic Ai and you have symbolic AI agents and the idea is kind of like programming right you can program all the rules to interact with all the different kind of environments and uh you can have expert system and stuff U and then you have this period of
AI winter right and and then you have deep learning and you have this very powerful Paradigm of RL agent and it's usually deep R agent where you have a lot of amazing miracles from Atari to Ala go so on so forth and only very recently we have L Majors right so this is this is obviously wrong but if I have to put things in one slide this is kind of the perspective um and and remember the examples we look at at the beginning of the talk right this is like a very typical example of a
symbolic a agents and lstm DQ is very like typical example of a deep RL agent in in the text domain uh and I think one way to think about the difference between the three paradigms of Agents is uh the problem is is the same right so you have some observation from the environment and you want to create action right you want to take some action and the difference is uh what kind of representation what kind of language do you use to process from the observation to the action right so if you think about symbolic AI
agents essentially you're first mapping all the observation into some symbolic state right and then you're trying to use the symbolic state to derive some action you can think of if else rule essentially you're just trying to map all the possible complex observations into a set of logical Expressions right and if you think about all the Deep RL agents like a very abstract way of thinking of this is you have many different possible forms of observations where it could be pixel it could be text it could be anything but from a different perspective it doesn't really
matter because uh it's mapped into some kind of embedding right is processed by new network to some vectors or matrixes and then use that to derve some actions right uh and in some sense uh what's different for language agent or reasoning agent is that you are literally using language as the intermediate representation to process observation to action right like instead of this neur neuro embedding or this kind of symbolic State you're literally thinking language which is kind of the human way of doing things right and um and the problem with symbolic state or neur embedding
is that if you think about it it takes intensive efforts to design those kind of symbolic agents right if you think about how VMO is built as a autonomous car right you probably write minons of line of rules and code right and if you think about uh all those different agents most of them it takes millions of steps to to pain them right and and the problem is uh both are kind of f specific right if you write a minons of lines of code for autonomous car right you cannot really reuse that for playing video
game similarly if you train a agent using using Min steps to to play video game you cannot use that to drive cards right um language is very different because uh first you don't have to do too much right because you already have Rich requires from LMS that's why you can prompt to build LM agents it's really convenient and uh it's very general right you can think about how you drive a car you can think about how to play a video game you can think about uh which house to buy right considering mortgage rate and stuff
uh and and sying is very different from symbolic State and deep RL because uh the symbolic State and the Deep RL Vector right they usually have a fixed size but you can think arbitrary long right you can think about a paragraph you can think about a sentence and and that brings this whole new dimension of infite time scaling and uh and that's why fundamentally reasoning agent is different okay so uh I just realized I just covered half of uh the later half of the brief history of AD agents right we talk about long memory and
why the methodology is fundamentally different from the previous agents I also want to briefly talk about the new applications and tasks that LM agency enabled so U as you can see in the beginning of my talk the examples are basically question answering and playing games uh and that's pretty much the the if you think about it that's pretty much the predominant Paradigm of MLP and RL right but I think what's really cool about language agents is that it really enables much more applications and in particular uh what I call digital automation right so uh what
I mean by digital automation is imagine if you have a assistant that can help you file reimbursement reports or help you write code uh wrong experiments help you uh find relevant papers help you review paper help you find papers that's that are relevant right if all of them can be achieved then everybody can graduate undergrad in two years or PhD in three years or uh or like get 10 year in three years like everything can be speed up but if you think about it like before charb right there's literally no progress if you think about
Siri right which is the state of art digital agent before chvt right it literally can do nothing right and uh and why is that uh I think the reason is that uh you really need to reason over Real World Language right if you want to write a code like this Paradigm of sequence to sequence mapping is not enough you have to think about what you write it and why you write it and you have to make decision over opened actions over long Horizon right uh but unfortunately if you think about if you look at all
the agent benchmarks before before the existence of L or agents they often look something like this right so they usually they are usually very like sythetic task very small skill and not practical at all and uh and that's that's been limiting for the history of L agents because even you have the best agents in the world if if you don't have the good task like how can you even show progress right because like let's say we solve this like great game with 100% accuracy then then what does it mean right so uh I think the
history of LM agents on one side is all the mthods getting better and better but uh equally if not more important side of the history is we're getting um more practical and more skillable costs so so uh to have a flavor right this task is called webshop and I created with my co-authors in 2021 2022 and the idea is you can imagine LM agents to help you do online shopping so it give you give the agent instruction to find a particular type type of product it could just brow the web like a human right it
could click links it could type search crus it could check different products and go back and search again and if it has to search again has to explore different items let think about how to reformulate the query right you can you can immediately notice right the environment is much more practical and much more uh open-ended than than grid work and uh let's say you find a good product you can just uh click all the customization options and you can click by now and uh you are you can also give a reward from the1 indicating how
good yourself going pass so it's really like a very standard Paradigm of reinforcement learning environment except that the observation and action is in text and it turns out to be a very very practical environment and uh web shop is interesting because it's the first time people build large scale complex environment based on large scale uh real internet data right so at the time we script more than a million Amazon products and we build this uh website and we build some automatic reward system to uh to tell you know if you find a product and here's
the instruction then how can you give a reward to indicate how matching are the two things and you can clearly see it's perhaps harder than the G word task because uh you need to understand not only the images and language in real world domains but you also need to make decisions over long Horizon right like you have to maybe look like explore 10 different products or in different search ques to find the perfect match and um and for example on this direction of lab interaction like full work has made a progress great progress you know
beyond shopping you can you can actually solve various tasks uh on the web and and you can also uh try to solve other practical tasks uh for example software engineering right so uh in this example suben is a is a task where uh you are given a GitHub repo and a issue right so you are given a bunch of files in a repository right and and you're given issue right this thing doesn't work help me fix it and you're supposed to Output you're supposed to Output a faive that can resolve the issue right so it's
a very clean definition of the task but it's very hard to solve right because because if you want to solve it you have to uh intera a s code base right you have to create unit test you have to run it and uh and you have to uh you have to try various things just like a self- engineer um so another example that I think is really cool is uh like it's I think the current progress is well Beyond digital automation right so in this example from uh camc a work that I really like uh
the idea is uh they're using reasoning oron agents to try to find new chromal and what's really cool is that uh you give the agent a bunch of data about some chemicals and you give them access to use tools like python or Internet or whatever and they could do some analysis and try to propose some kind of new chemical and also the the uh action space of the agent is somehow extended into the physical uh space because the the action or the suggestion from the agent is then synthesized in the we lab and and and
then you can imagine you can get feedback from the web and then you can use that to improve yourself and stuff like that B so uh I think it's really exciting that you can think of language agent not only as operating in the digital domain but also in the physical domain not only in solving like TDS tasks like bookia door Dash but also more intelligent or creative task like selft engineering or scientific discovery okay so great so we have covered this slide finally so uh in summary I have talked about you know how we start
from LM we have this Paradigm of reasoning we have this Paradigm of acting they converge and that brings up uh more diverse tasks and mthods and we have also covered in a more broader you know time scale the paradigms of agents and why this time is different and uh also from a tast perspective right so uh the previous paradig of tast if you think about in AI uh you can think of games you can think of simulations you can think of uh robotics but uh really LM agents bring up this new dimension of task which
is to automate various things in digal word so we have covered a lot of history and uh uh I just want to summarize a little bit in terms of lessons for doing research right so uh I think personally uh as you can see you know the it turns out some of the most important work is some sometimes most simple work right you can argue like Chain of Thought is incred incredibly simple and react is incredibly simple and and simple is good because simple means general right if you have something extremely simple then you have probably
something extremely General and that probably is the best research um but but it's it's hard it's hard to be simple right so uh if you want to be simple in general uh you need to both like have the ability to thinking abstraction right so you have to jump out of individual tasks or data points you have to syn in a higher level but you also need to be very familiar with the the individual task the data the problem you're trying to solve right so uh notice that you can actually it could actually be distracting to
be very familiar with all the task specific methods right so remember in the history of QA I covered all those a lot of like task specific methods like if you are very fism then you might end up you know trying to create an incremental solution after that but if you're familiar with not only QA but a lot of different tasks and you can think in OB Direction then then you can propose something simpler and more General and in this case I think really learning the history helps and learning other subjects helps uh because they provide
you some prior how for how to build abstraction and they provide uh Ways to Think in abstraction okay so this is mostly the talk I think uh I will just briefly talk about some thoughts on the future of LM agents right so uh everything before this slide is history and everything after this slide is kind of the state of art or the future um obviously future is very multi-dimensional uh there are many directions that are very exciting to work on uh I talk about I want to talk about five keyword that I think are uh
truly ex truly exciting topics that are first very new in the sense that if you get to work on this now there might be a lot of low highing fruit or you might have a chance to create some very fundamental uh results and second is somehow doable in the Academia setup so you don't have to be open ey to do this uh but it's still good to be open um so uh this five topics actually um correspond to three recent work that I did and I'll only cover them briefly and if you have more interest
you should check out those papers uh yourself so the topic is uh first training how can we train models for agents where can we have the data right uh second interface how can we build an environment for our agents third robustness right how can we make sure things actually work in real life uh for human how can we make sure things actually work in real life with human right and lastly Benchmark how can we build good benchmarks so first uh training right so I think it's interesting to note that like up until this year uh
language model and agent are kind of disentangled in sense that the people that are training models and the people building agents are kind of the different people right and what is the Paradigm is that the model building people build some model right and then the agent building people build some agents on top of it using some fine tuning or some prompting mostly mostly prompting however this model are not trained for agents right so if you think about the historical root of language model it's just a model that's trained to text like people can never imagine
is one they used to solve like chemical Discovery or self engineering right so that brings the issue of discrepancy of the data array so it's not train to do those things but then it's prompted to do those things so the performance is not optimal and one solution to do that is uh one solution to fix this is you should train models targeted for agents and one one thing one thing you can do is you can use those prompted agents to generate a lot of data and then you can use those data to find the model
to be better as agents and and this is really good because first you can improve all the agent capabilities not covered in internet right so you can imagine uh on the internet which is the predominant source of language model training there is not a lot of like self evaluation kind of data right people only give you like a well written blog post but no one really releases all the thought process and action process of how to write the blog post but that's actually the what matters for agent training right so you can actually prompt agents
to to have those trajectories and you can train models all those things and and that's I think really one way to fix the data problem because we all know internet data is running out and how can we have the next trillion dollars to train models uh this is very exciting and I think a very like maybe not best analog is you can think of the like the the Synergy between GPU and deep learning right because GPU first was not designed for deep learning right it was first designed to play games and then people explore the
usage find oh it's very good for deep learning and then what happens is that not only people use existing gpus to build better deep learning algorithms but also the GPU Builders fre better gpus to fit the Deep learning algorithms right you can build a GPU that's speciic for Transformer or so on so forth I think we should also establish the Synergy between model and agent and the second topic is interface and in fact human computer interface has been a subject for decades right it has been a a great topic in computer science and really the
idea is if you really cannot optimize the agent you can optimize the environment right because if you're trying to write a code right even if you're the same person it makes a difference whether you're doing that in the text edit interface or the vs code interface right you're still the same U like you're not being more smart but if you have a better environment then you can solve the Tas better I think the same thing happens for uh for agents right so it's a very concrete example you can imagine uh how can the agent like
say search files in OS right so the the the human interface in terminal as we all know is to use LS and CD and so on and so forth uh it works for humans but it's not the best interface for agents uh you can also do something like you can you can you know you can define a new command called search and then it will it will give a result and then you can use this action called next to get the next result but it's probably still not the best for language model So In This
research Called Agent what we find is that what turns out to be the best way to help agents uh search files is to have this specific command called search and instead of giving one result at a time you just give 10 result at a time and then use that the agent decide what's the best file to look at right and you can actually do experiments and show like you can use the same language model you can use the same agent prompt but the interface matters for for for Downstream tasks so uh I think this is
a very very exciting topic and it's only getting started and it's a great research topic for Academia you don't need to have a lot of gpus to do that uh and and it's interesting because models and humans are different so social interfaces right you cannot expect language models to use vs code to be the best code interface right so there must be something different and we need to explore that uh and in this case you can think of the difference as being that human we just have a smaller shter memory right if I give you
10 results at the same time you cannot just read them that's why for human interface you have to design that in an interative way right you have a next button right if you do control F you can only read one thing at a time but actually for models it's it's worse because models have a longer content window right so if you control F you should just probably give everything to to the model right so if you design better interface it could help you solve H better with agents it could also help you understand agents better
right it could help you understand some of the fundamental differences of human and mods uh lastly I I want to point out this topic of human the loop and robustness so um I just want to point out there is this very big discrepancy between existing benchmarking and what people really care in the real real world right so you can think of a very typical agent task or AI task say coding with unit test and and this is a part from alpha alpha code 2 and basically the idea is if you sample more times then the
chance that you have a right submission increases right that's very intuitive and if you have a unit test then then you can SLE many many times obviously and what you really care is what we call pass at K right what you really care about is can I solve it one time out of 10,000 times or thousand times or million times right it's kind of like solving REM hypothesis you just need to do it once right what you care about is if you sample 10 million times can you solve it once but if you think about
all the most of the jobs in the real world it's more about robustness right so suppose you're trying to do customer service right uh like LM agents already deployed for customer service but sometimes they loses and there are consequences right and if it may do something the company might have composition and stuff that right so um arguably you know customer service is much easier than coding or or or proving REM hypothesis at least for human right but here it really presents a different challenge because what you care about is not can you solve it one
time out of a thousand time what you care about is can you solve it a thousand time out of thousand time right you care about what I feel it one time out of s time because if you feel one time then you might lose a customer right so uh it's more about getting simple things done reliably so I think really that calls for a different way of doing benchmarking and uh we have this recent work called tall bench and the idea is uh first you have a very practical task which is to do customer service
second uh you're the agent is not only interacting with some kind of environment like digital environment it's also interacting with a human but it's a simulated human right so the idea is uh the customer service agent just like a human customer service agent need to interact with both the backand NPI of the company and also some kind of user and the agent really needs to interact with both to solve the task and the trory might look something like this right so the the the human might might not give you all the information at the beginning
right which is a a predominant Paradigm of all the task right now if you think about self engineering and so on so forth imagine say something like change flight and then you you might need to actually prompt the user oh can you tell me like which fly are you changing and you need to interact with the user over multiple terms to to figure out what they need and to and to help them right and this is very different and also it makes the metric that you care about different right so uh you can imagine for
the same task right uh you can sample the CH multiple times with the same user simulation so if you look at a dash line it's called pass at K which is measuring you know if you sample 10 times can you solve it at least one time and obviously as you sample more the chance that you solve at least one time increases right but here you don't care about whether you can solve it one time out of 10 times you care about whether you can solve it 10 times out of 10 times because otherwise you will
lose client you might lose customer right so so so the solid line measures is you sample more what's the chance you can always solve the task across all the possible tasks and what we see in today's language model is that obviously they have different starting points meaning like obviously they have different capabilities but uh what's really concerning is they all have this decreasing Trend right so if you sample more the robustness always go down like from the small model to big model they all have the similar Trend the ideal Trend should be something more flat
right so if you can solve something you should be more reliable to solve the same thing over time so I just want to point out that uh I think we also need some more effort in taking more real world elements into benchmarking and that requires new settings and Matrix so uh we have this blog post uh that talks about you know some thoughts on the future of language agents and one way to think about that is you want to think about what kind of jobs uh they can replace right and and if you think about
it right maybe the first type of task is kind of not that intelligent right but really require robustness right if you think about simple debugging or doing customer service or doing Simple assistant over time and time and second you need to collaborate with humans and third you might need to do very hard task right you can you might need to do like write a survey from scratch or discover a new chrom aold and that requires some new type of ways for the agents to explore on its own but I think it's in general very useful
to think about what jobs uh they can replace and why they're not replacing those human jobs yet and what are missing and how can we improve it um Lastly lastly uh this is a like limited time and we're going to have a uh em tutorial on language agents uh in November and uh it will be three hours so hopefully it will be more comprehensive than this