this is work with a very large number of people over 200 contributors and I include a lot of slides adapted from a couple of these folks Christa and Michael so the context of this lecture is that it's never been easier as we're all aware to build really impressive AI demos and you know pretty obviously one of the best examples of this is none other than chat GPT it's this one interface um where you can ask just about any questions that come to your mind and not only will the system find answers that suit your uh
your questions it will synthesize them on the fly conversationally now we're all taking that for granted but let's just marvel at how amazing that is now what makes it really cool uh and a big step forward compared to before is that it could also help us with tasks so I have here an example of uh me asking Chad JP to help with a piece of code I'm asking it to parallelize uh sequential python loop I had now you know and it does exactly as I asked now at this stage uh we're all extremely familiar with
the primary weakness of these language models which is in some sense precisely their key strength which is that they are so fluent in fact that when they make mistakes um they can be incredibly hard to uh incredibly hard to detect so Stanford uh was not founded in 1891 although it's close maybe using that example at berley is weird but anyway um and the whole reason I asked for help with this piece of code was that um you know I wanted to avoid data races so when I was given a piece of code that simply naively
parallelized the code and had a bug in it I was glad I didn't trust that model um and you know didn't simply plug in this code into my code base so the big picture here is that even though it's incredibly impress incredibly easy to build impressive demos turning monolithic language models into reliable AI systems remains really hard and that's what the rest of this talk uh will focus on now these are not just problems you know you see in your kind of personal interactions with a with a chat bot these have real implications all over
the place and that's just one example um of this happening in practice now I'm not saying Chad GPT is is bad or language models are bad because it made a mistake or two um what I'm saying though uh and that's because every AI system and everything we'll talk about today will always make mistakes what I'm saying though is that fundamentally the monolithic nature of language models uh makes them particularly hard to control when we are Building Systems to debug when they do make mistakes and to improve uh when we want to iterate on the development
of our systems and increasingly the way people are tackling this as we had in the title of the of the lecture is by building compound AI systems so if you haven't heard of this term before it just means modular systems or modular programs in which language models are not serving as the user facing endtoend system but they're actually playing specialized modular roles inside this bigger architecture so you know a very familiar example of that that I'm sure sort of Prior talks discussed a lot in this in this class uh is retrieval augmented systems so you
know instead of building a system that takes a question like this and just gives it to your monolithic language model just a blackbox deep Neer Network and basically hopes that the system does the right thing um you might break it down into smaller pieces where you have a retrieval model consult a massive corpus with the question as a search query retrieve the topk you know most relevant results feed those as context to the language model and then basically prompt the model to use the information there and cite it in its answer now there are lots
of reasons that might be attractive one of those is transparency this system might still make mistakes or it might say the right thing but either way uh we can always inspect sort of the trace of the system Behavior see what it retrieved and see why it generated what what you know the information it did that would help us see when it's right you know it says something Justified based on a citation that's factual um it could also help us see it when it's wrong uh for example it made uh an incorrect inference based on a
relevant piece of information or maybe it simply retrieved something irrelevant and extrapolated in a weird way there's another reason this is really attractive which is that um you know a system like this can be a lot more efficient it has more steps for sure um but now the language model does not necessarily need to know everything about old topics because we've sort of offloaded that knowledge and a lot of that like basically uh step-by-step control flow either to a knowledge base from which we're retrieving um or to Simply a program that is executing these steps
so we've gained a lot from having these sort of uh this compositional approach another another example of a compound AI system Beyond Rag and something slightly more sophisticated is taking that to the next step and asking well one of the biggest sort of powers of language models is that they are not limited to answering questions they've seen exactly before they can sort of synthesize or compose informations or at least that's the that's just the hope um so this is exactly what a compound AI systems system that does multihop reasoning or multihop retrieval augmented generation would
do it's the fact that we can now take questions uh and instead of simply giving them to a dumb search engine we can ask our language model itself to act as a module that breaks down this question into little pieces finds information about each piece and then basically returns that to the language model whose job is to then take uh you know whatever got retrieved and produce answers that maybe synthesize information in a more holistic way what's really great about sort of compound AI systems uh continuing that line of thought is that we have a
lot of control as the people building this system um you know you're not sort of bottlenecked at the next release of a language model whether you're the one building it which might be unlikely or not uh you actually have a lot of control over the development um of your architecture and that allows you to it at much faster and to build systems like this system here bin that I built near the start of my PhD that do things that your individual language models at that time absolutely couldn't do another example is to take this an
even you know another step further uh and to basically say uh could we have these systems generate long reports like maybe articles with citations like a Wikipedia page and that's where uh you know maybe you could say something like write me an article on the on the kar retrieval model and a system like this which is Storm from Stanford um could then basically have a lot of modular components that are at a much finer level of granularity so it could brainstorm uh it could uh generate outlines it could revise these outlines uh it could then
basically ask a model to generate questions about each part of the outline retrieve sources uh synthesize information in a much more systematic way uh you know than a simpler architecture would and what's really great here is that because we have sort of this ability to compose um better scoped uh uses of our language models we can really iterate on the quality we get out of these systems um so it's much easier for a language model to you know take information that's given to it and synthesize something reasonable as opposed to maybe tackle a bigger problem
of trying to sort of um remember facts for example so by by being able to manage all of these little compositions we can do a lot more uh in terms of quality and lastly uh the sort of final uh advantage of building these types of systems is inference time uh uh scaling so basically uh if you have a language model it's pretty clear at this point that if you intelligently spend more compute at test time so you have a question and you're trying to sort of Leverage The Language model maybe to search over a large
space of potential paths for example that that's something that can really help so an example of a compound AI system where this really helps here is Alpha codium uh which basically is the system that is uh targeted at generating code but instead of simply asking a model to produce a you know a piece of code it has all these steps in which it would for example um you know reflect on the problem um reflect on public tests that you know maybe exist uh for a for a given for a given task that you have generate
various Solutions rank them based on their performance and then basically iterate from there and it's not you know unintuitive that this that type of decomposition which reflects the way you might instruct a you know uh a intern or a friend or something to to approach a task like this it's not unintuitive that that can really help quite a bit and that's what this type of work shows and this something that keeps popping up in many many many applications uh in many cases also uh you know in task agnostic form you know in the form of
methods that basically can scale compute um independently of a particular task so we talked about how amazing compound AI systems are and how sort of they give us these five things from quality uh to control to transparency to inference time scaling so they're awesome but the problem is unfortunately we're working with highly limited modules at the end of the day our language models um themselves um are extremely sensitive to how you ask them to play these roles of these little things you're trying to compose and because of that under the hood these really beautiful diagrams
we looked at they are typically implemented uh you know and we're all guilty of this myself included they're typically implemented with you know thousands of tokens of of English or some other natural language trying to coers a language model uh to play a to play that role of each of these modules and so this is something we see across all types of tasks uh where we're writing sort of tens of kilobytes of strings in Json files uh trying to Define our our language model systems and of course if we could just do that and then
succeed in a way that is sort of Highly portable and in a way that's highly General maybe it's fine maybe it's a fine price to pay uh but the problem is that it's not uh it's not something that will generalize or sort of will systematically allow us to compose these systems so the problem here at a high level is that each of these prompts implementing each of the modules we looked at is sort of coupling five different roles so first role it's sort of trying to uh to to cover is the role of a signature
the role of like I want a function that's playing this this module and essentially it's just a specification of here are your inputs and here's what they mean and here's what here's the transformation I want you to make to give me the outputs it's also specifying the computation uh sort of that tries to specialize this signature uh with some kind of inference time strategy so you might tell the model hey please think step by step or you might tell the model hey you're an agent and you have these tools and you should call these tools
and when you call them I'll intercept it I'll give you back the response or you might say something like you know I'll I'll generate 10 different things and explore all of them with some kind of reward model for example um so these cases we're sort of expressing a lot of that in the prompt and in the code around the prompt um The Prompt is also trying to sort of couple with all of that the computation that's well formatting all of the inputs we want to give to this sort of function or the signature um and
is also encapsulating the logic for how we want to get several things and sort of parse them and cast them to the right types and maybe retry if they're not really formatted properly and all that stuff it's also on top of that expressing well the objective you know you're you're not just telling uh you know you're not simply describing what the inputs and and output behavior is you're really also trying to encode and your prompt a lot of information about dos and don'ts and basically what you're trying to maximize you know be factual or don't
hallucinate or you know uh don't site pages that don't exist or whatever um and so you know that's that's essentially a different notion uh from Simply declaring with the what what that module is doing as we see in the diagrams and lastly um for any given language model as we said they are super sensitive and these pipelines are really complex so you're going to see in practice that people work really hard to sort of um CSE the model to basically do the right thing through a lot of trial and error in English or in some
natural language uh and maybe there's also sort of fine tuning of the weights around that if if you're doing a you know a more advanced application so the bigger picture here across these five roles is that existing compound AI systems are awesome they have a lot of potential to be really modular and to sort of solve problems we can't solve with neural networks alone or with language models Alone um but the problem is they're too stringly typed if you will uh they couple the fundamental architecture of the system design that the developer wants to express
which nicely is the thing we see in the diagrams we saw of compound AI systems um they couple that with incidental choices that are super specific to the specific pipeline you have or the language model Choice you've made which are things that change all the time so you have a pipeline that has five pieces you go in and you want to slightly change the objective or introduce a six module or maybe change the language model to a cheaper thing that just came out that is supposed to be as good good luck your entire sequence of
you know thousands of words of of prompts are basically irrelevant at that point and this is something that really blocks adoption in practice as well as portability of a lot of the cool systems we we we covered where people in industry for example could um approach you and say well we really love that paper um but it's clear that the prompts there were tuned sort of and developed with a particular data or Benchmark in mind that's not our use case we have no visibility or clue how you decided to arrive at those prompts so we're
basically practically unable to use the stuff you built so uh an argument I'll be making today and I'll be relying on in the rest of this lecture is that we do know how to iteratively uh and controllably build systems and improve them in a modular way and that sort of elusive concept is called programming um so a lot of the rest of this talk will be about sort of this ambition of what if we could build compound AI systems as programs right just computer code standard python um but with fuzzy natural language functions or natural
language modules that can learn their behavior from data so you're basically thinking of just writing code and then inserting these little pieces that are supposed to represent the boxes we saw earlier that are sort of exhibiting intelligent behavior and then you specify an objective and you have the system try to learn sort of behavior around the natural language specifications that you're putting around these modules if we could do that we could lose a lot of the problems that we saw with existing compound AI systems where you're not going to have to be writing these prompts
you're going to be building a general design that could sort of be ported from a language model to another language model and in this analogy the portability is sort of like working with uh with Hardware you know if you write a piece of code in C on a certain CPU architecture and you move to a different CPU architecture a good compiler is going to take your same high level C high level between codes and then compile it to potentially very different lower level uh machine code on different architectures that makes sure that it works so
here we want to uh the sort of the ambition that we have is that we want to write code in higher level uh uh higher level abstractions that are sort of um independent as much as possible uh of these low-level details of how to get a language model to do the thing you want it to do and then we're going to build compilers U or optimizers to take that and spit out specific sort of uh strategies for working on a given uh in a given infrastructure which in this case means a language model or other
things so these are some of the systems we saw earlier and so you know the approach that we will take sort of to where it's this vision is called DSP um someone said I think Alex said data science python it has nothing to do with data science it's a long story uh but DS is declaratively self-improving declaratively self-improving python so it's a python but it's smarter um and the way in which we're going to do this is we going to say well let's define this thing called the language model program which is just a Python
program just a python class or function um and in that function uh you know it's going to take it's a function that takes inputs in natural language maybe a question that we want to answer uh or maybe like a report we want to summarize or a topic we want to generate a I don't know an email about or something and the output is also a natural language maybe the answer or the report or the email or whatever it is and maybe you know the output could be several pieces of these and what makes this interesting
as a function is that in the normal course of its execution it's just kind of loops and exceptions and go-to statements and all the things you shouldn't use and all that stuff but in the in the course of its execution it's calling these functions that are modules that are fuzzy so it's basically making calls like hey generate a search qu for me or turn this natural language into SQL or whatever it is so it's making these kind of like high level natural language claims which look like prompts but they're sort of much more structured and
much shorter because they are only expressing uh sort of the the stuff that you need to declare like what is the actual Behavior not how should the model Do It um so we have this function in which in it in its course of execution it's calling uh these modules and each module is basically defined by simply a declaration of what it should do what inputs it takes and what outputs it produce it produces usually that's one liner basically in in many cases so here's an example um of an actual sort of simple DSP run I
put before the talk so basically you can say I want to factchecking module you know before that there might be like loops and exceptions and commented out code and all sorts of mess and then you have basically uh you know I want to apply a chain of thoughts strategy over this function signature that says I want you to take a set of claims and give me a list of booleans that is the verdicts of you know are these claims true or not and then I get a module as this function I can give it a
set a list of claims and it can basically think and then it can tell me well the first one is true and the second one is false right so so far this is basically just kind of an interface of a function that is declared you know with natural language types we're basically just saying you know the the keywords here are not special but I can understand them because I understand English as a language model and I can sort of see that okay the verdicts here is is going to sort of suggest that probably until I
optimize this or learn somehow you know this is this is a true statement and python is not a compiled language in standard usage so this basically defines for us an optimization problem uh under the hood which is basically you're given this set of modules in our program that your program calls um and for every module our goal is going to be uh to sort of basically decide how should we take this call and actually call a language model which means coming up with the string that goes into the model uh which might be a lot
more you know sophisticated or or just very different from how looked at the specification level we saw earlier in the signature um the other thing we also can control in some cases although we're not cover a lot of that today is well you know what are sort of the weight settings that we want to assign our language model you know in the sense of if we want to fine-tune the model uh to sort of perform better at our task so you know the idea is now we have this function with all these modules and if
you can give me a training set you know of inputs to that function and maybe some kind of hint or find a label or some kind of metadata um I should be able to basically try to find the settings of the of the prompts and the weights such that on average or you know in expectation um some kind of metric uh of calling the the program with modules assigned in this way is maximized that's the optimization problem that we will be dealing with you know once someone uh writes a program in an abstraction that could
support uh this type of Behavior Uh which which is the SP now um you know the problem here is that this is really hard um so we don't have and we're not asking anywhere um for you know gradients across the system system might generate code and execute it it might call external tools it might run a calculator um so we don't necessarily know how to optimize a system like this certainly not not directly with gradients and we can't sort of like cheat and try to optimize each module on its own because we are not in
general asking for labels for every step of the system if we were to ask for labels for every step of the system the whole argument about iterative development and modularity would collapse because I added the module and now you're asking me to go label some data uh whereas the ideal sort of um you know development cycle that we want is that you're building a system you notice a need uh you notice a need to add a module of some sort or you want to experiment you just inject it you recompile you see what happens and
see if your system is better or not so basically we can't assume the existence of of um of labels for each of these modules so that's a problem because it's not obvious how to go about sort of optimizing this and that's one of the things we'll talk about in the rest of these uh this slide and let's take a concrete example here so that the rest of this sort of is developed in a way we can uh we can follow so suppose that we wanted to build this uh simple multihop retrieval augmented generation Pipeline and
have it in visual form here although the DSP code is basically the same um you know it's a function that takes a question which is a and it's going to Output an answer which is also a string and in that function you know the last step is this question that we got with some context we want to give it to a language model in some way and basically have it generate our answer but the question is how do we get this context so the multihop thing here is that we will have a loop maybe the
loop could be smarter it could like decide when to stop or something but for now we'll just hard code that it goes twice and in every iteration we'll ask our language model to take whatever context we've built up so far which is empty at the beginning um generate a search query take this query dump it into the context maybe append it or something and then basically we can loop again so that the system is generating queries to seek things that we haven't found so far um so a question that you know is is would come
up in in this in this scope would be something like I don't know how many floors are in the castle that David Gregory inherited an example I like to use so the first qu you might ask is well who is David Gregory um and then you might learn that he's a Scottish guy from I don't know the 1600s or something and that he inherited a castle called kenady Castle then you might ask well how many Flo are in Canard Castle given that the original question was trying to seek that um and sort of that gives
you the behavior that we're seeing here in terms of the multihop U Behavior so that sort of visual function can be written in DSP code as as as like here we're not going to look at too much code uh but this is just a module if you're familiar with something like pytorch or just de networks in en we're kind of borrowing some of the syntax um but in this module uh we're going to uh have an initialization sort of method that just declares the uh sub modules that we have in our compound AI system so
we're going to have a sort of a Chain of Thought strategy express our sort of signature that says you take some context you take a question and you generate a search query and all of these are strings so we're not telling you they strings because that's the default and then there's another module that takes the same inputs or same types of inputs and generates an answer now the sort of part of your code that is the actual sort of program logic is here where you're simply just you know the loop we saw in the last
slide can sort of be expressed so that's the forward method can Define the program logic that we have we talked about signatures so they are telling the system what the module should do uh as opposed to working really hard with a certain language model as to how it should Express that behavior and the modules as we said you know know there are many uh sort of strategies to spend compute uh you know given the language model so sort of Define a general sort of um um approach to take a signature and actually express it in
terms of Co language model and those sort of help uh you know uh improve quality in many settings so there are sort of puristic that we can apply very similar to layers in neur networks so I'll go sort of an a tangent here and say like you could if you want and this is sort of like if if if that's this is not essential to understanding the rest of the talk but sort of intuitively you could think of this as these are normally Chain of Thought or other things just conceptual things that you can't actually
compose because the Chain of Thought you know in the original paper which is super cool is basically just a bunch of prompts they wrote for individual tasks so for answering math questions you you write some examples and you ask the model to do stuff you going to apply to different task you got write different prompts but what we're saying here is why can't we borrow from layers in Neo networks where you give them Dimensions that sort of just describe what type of types of tensors they accept and what types of tensors they give you um
and then have that behavior be expressed so in a neural network architecture you could say I want attention that takes a following sort of vectors and gives me those other vectors and sort of you could have that or you could have an RNN do this or a linear layer or convolution or whatever so similarly here we're asking why can't we take these uh sort of inference strategies and make them General in a sort of meta programming sense and sort of Define these you know Define them over these signatures uh and have them Express that behavior
in a general way that can be composed a aome so the question here is this is the abstraction this is how you write the program how do you actually sort of what are we supposed to do so this actually works as a system because it looks pretty to me it looks pretty maybe not to everybody but to me this looks really elegant but the question is how do we take this and actually sort of give you a system that you know you could actually deploy and that you're happy with and you could iterate on and
all that so iterating on it is easy because you could come in and like add one more module add an if statement throw an exception and Gadget and you can do all kinds of things but the question is really how do we uh sort of translate these strategies with signatures into actual prompts under the hood so you can take this here and the very first step is pretty trivial although it's important to do it right uh which is that sort of internally we can sort of translate this into a basic prompt one that will not
necessarily work really well or anything there's no guarantees about it um through sort of buil-in adapters and predictors so predictors are just this guy which is basically another module that has some logic inside and really what it does fundamentally is it adds one more output here another field here that asks for reasoning um but the adapter is going to take this guy and is going to say well you know depending on how you implement it or which which one you choose or whatever um it's going to say well I have a model here that's maybe
a chat model or maybe it's an instruct model and I want to sort of format these fields in a certain way just to Kickstart the process so it's going to basically give us a basic prompt under the hood that's something like hey give a field context and question you know give me a query and here is the format you want to follow because I want to be able to parse the stuff that you give me nothing about this is going to work particularly well but it's is going to get us started now the role of
optimizers on top of this which is sort of a different component many algorithms we'll discuss um is to take this initial prompt and to take the whole program in which there are many of these modules many of these prompts and to sort of look at them as parameters or look at them as like you know having a lot of variables that we can Tinker with and figure out how to maximize sort of the metric that we saw in the in the objective we wanted to maximize a few slides ago um so we might start here
and maybe the system would perform and you know at 37% accuracy on a given you know strict metric so the actual quality might be a bit higher but you let's say this is something that has uh you know we're trying to sort of have a metric that's highly precise but has somewhat lower recal um and we might then basically just ask hey I like the MEO V2 Optimizer in DSP much the same way as in a neural network sort of situation you might say I like Adam or I like I don't know RMS prop you
give it some data you give it the program you give it a metric and you ask it to do a good job and it spits out you know a better prompt maybe fancier instructions and gives it some examples uh sort of and sticks them into this into the prompt of one of the modules or several of the modules such that the quality is you know substantially higher so you know on a certain real task sort of uh instead of tweaking you know 2,000 tokens of prompts that look something like this and you know at the
end of the day sort of getting an accuracy like 33% with a certain model from open AR um we can sort of explore a much larger design space and get higher scores in interesting ways you know through these compositions so let's look at at sort of a results table here on the multihop question answering task that we've been sort of using as a running example the simplest thing you could do in in dsy or sort of just when you're thinking about a compound AI system is the trival is a trivial compound AI system that's not
actually compound uh in which you just ask the system to take the input and just predict the output and you don't bother optimizing it so if you do that so we did this a while back this is you know exactly a year ago uh this result um you can get GPT 3.5 to perform you know sort of at this at this level and a llama 2 model sort of is a little bit uh behind but you could say well we know that a decomposition like rag could help especially that this is a factual task we're
answering questions um so we could do a rag system and the interesting thing is well we could also try you know running one of the optimizers that we'll talk about in a minute and what you start seeing is the cool thing that already you know a small model um can start to perform better than a larger model uh with a sort of simpler architecture or even with a sophisticated architecture without optimization but the large model can preserve a lot of its Advantage uh here as well now you could iterate on your program and build a
multihop compound AI system and there you can see that you know with optimization you can boost the quality quite a bit with large models as well as sort of with open models and with more recent models all these numbers are substantially higher nowadays now what's really cool uh although we're not going to be able to spend a lot of time on that is the very same program that we wrote here which is giving us all of these results across models and across sort of optimiz optimization uh decisions that we can make we can also use
optimizers that sort of update the weights that do fine tuning essentially uh a whole Space of reinforcement learning is open in front of us um and we can get very small models to sort of um imitate these bigger ones and we can have like a model beneath a billion parameters that's like many years old T5 is I don't know 2019 uh score competitively in fact better than you know uh close sort of at that at that time Frontier models uh sort of um uh on uh on uh the the given task so question is uh
that we will be looking at over the rest of the lecture is what are the what is the space of optimizers like and what are the interesting choices we can make and what works and what doesn't um so in general uh there are too many DSi optimizers to discuss today they vary in how they tune the prompts and weights in a program but there's actually kind of a general pattern that you can see in many of them in most of them um the first thing is in general as we saw earlier you're going to for
every module just guess an initial prompt and this is actually something in DP happening outside the optimizer in general it's happening through what we call the adapter that just sort of like takes your signature and makes a sort of a deterministic construction most of the time deterministic sometimes with the language model um of what the initial prompt should look like and then the next thing we'll do is rejection sampling so we're going to take the inputs that you gave us in the sort of in in your problem specification maybe we'll raise the temperature of the
language model uh maybe we'll plug in a large model uh maybe we just keep the system simple um and then we'll basically run through your program with these basic prompts to sort of try to collect trajectories throughout all of these steps um that lead your metric to sort of assign High scores so you've basically given us this program you've given us a bunch of inputs and you've given us a metric that can assess when things are good or bad maybe it's just checking if the answer is correct or maybe it's asking a language model or
maybe asking a DSP program to sort of evaluate the system on the rubric or any number of other design choices that you're allowed to sort of you know experiment with um so we're going to use those in order to basically start to collect traces um of every module or examples of every module of its inputs and outputs when they're chained together sort of uh lead to high scores through your through your metric we can then use these examples in a bunch of different ways to update the modules of your program the simplest thing you could
think of is if you have examples of every module that have sort of proven to work in the past it doesn't mean that they're correct but it means that maybe there's some hope that they're useful and the simplest thing you could do is try to stick them into the prompt as demonstrations as examples that just say hey when I got this input and I did this Behavior it seemed to work pretty well in the past um and you know once you do that maybe you can actually start exploring well which of the examples and can
we try to be intelligent about this or not which of these examples that we just built uh can be useful um for actually taking this module plugging it into a program and getting the program to work better on a bunch of other examples another thing you can do is you could try to sort of induct instructions if you have a lot of examples of a module you have a better sense of understanding of what the module should do uh and when that leads to good behavior and when that doesn't lead to good behavior and you
can give those to a language model and ask basically you know what is the thing that we're trying to ask here and I thing to keep in mind is that in all of these cases language models are being used to build these components but we're not assuming that language models are good at it language models are going to give you 20 different things and you know 19 of them are going to suck but because we can try so many and we can explore that space to varying level levels of intelligence we can basically strike gold
fairly often and find the combination of these pieces that leads uh leads to better prompts in a way that is basically independent of a particular setup that you have so you can adjust your program or your language model choices rerun the system and and you know have optimization happen and then of course you could run you know once you've build these examples if you have enough of them you know you could basically uh fine-tune the model on them or do reinforcement learning uh depending on how you're sampling uh or do you know preference fine tuning
and other other approaches there as well so you know there are various papers uh that we have that explore these types of strategies the two ones I I I sort of requested in the in the uh suggested readings where uh sort of the one introducing Meo and the one introducing a strategy uh for combining prompt optimization and fine tuning and the rest of this talk will be about the MEO the MEO paper so before we discuss Meo um you know this stuff this is sort of we discussed a sequence of abstractions and sort of at
a high level how a set of algorithms that we call optimizers take programs written with these abstractions and sort of give you compound AI systems that work well but you're expressing in a way that is sort of more portable and I think a lot more elegant um but what's cool here is that these things actually work pretty well in practice so a few months ago University of Toronto researchers participated in this Meda competition against 15 other you know schools and Industry uh groups and whatnot um and they won the competition in sort of building question
answering systems for medical um you know domains by uh 20% by 20 point margin against the next best system and you know one of the biggest differences between that their approach and others is they use thepi to sort of Express their system and they use dpy prompt optimizers they used an early version of Meo in fact uh um to to achieve this result on three different you know they sort of won three different settings of that competition which had three settings um a month later uh Folks at the University of Maryland uh who also sort
of whose lead author um sort of is responsible or is leading this uh prompting guide with I think like tens of thousands of users or or more um sort of worked on a suicide detection task that I believe he got from industry they wrote a really nice paper about it he worked for 20 hours uh they documented all of the sort of prompting strategies that he explored for building this system for suicide detection basically it's a glorified classifier um and you know he documented all the approaches he tried the prompts he explored and then you
know in his words he applied DSP in 10 minutes and it sort of outperformed his best system by 40 to 50% so they have a really nice paper about this from the University of medine and then you know it has enabled sort of a lot of State of-the-art systems um from path uh for training retrieval models where we're optimizing prompts that synthesize the data uh that sort of is used to train to train small models uh under the hood I think there's a lot of sort of uh interesting ideas there um to uh AA which
is a system for classifying with language models uh when you have like 50 labels but 10,000 classes which I thought was an impossible thing just didn't make sense for the language models uh but Carl built this in dpy and sort of showed that it's possible um and then storm which generates Wikipedia articles um from a topic that you give it so now sort of having see some of these examples uh let's uh sort of uh look more closely at the paper optimizing instructions and demonstrations for multi-stage language model programs uh by co-first authors Christa Opel
long and Michael Ryan at at Stanford and a lot of the slides that I have next sort of uh are are sort of uh borrowed from Krista and Michael uh you know in the dsy work so the problem setting just to restate it is we are given a bunch of inputs examples of a task so maybe questions let's say and we are given that you know the developer built language model program of the S we discussed earlier just a function with a bunch of modules uh expressed in natural language um and you know each of
them might look like this so you know we looked at the context question give me an answer any it's just kind of just trying to do this Behavior Uh I don't know if example is very useful at that point but the victorians is a documentary series written by an author born in what year so you know maybe you want to answer that 1950 by have looking at context that you received in the earlier Loop that we have here we're also asking for a metric um some metrics you know by Nature require labels like for example
here if we know that the answer is 1950 we don't need labels for the whole thing but you know if we know the answer is 1950 evaluation is easy um other times maybe the metric is like I don't actually know what the answer is but I want to get answers that are grounded in the context that got retrieved because I trust the context and if the answer is grounded in it and is relevant then that probably means is correct uh that's a different metric that does not need labels um so there are sort of a
very large scope of things but it's really important that you sort of explicitly Define what is it that you're trying to maximize um and once you've done that you don't do this sort of as part of your prompts you do this in a way that's independent of your program um so that you can explore sort of along both axes uh pretty modularly that's something important that you should keep in mind and that abstraction sort of really forces you to do that okay yeah so the goal is to give us an optimized language model program Prime
uh by in the case of the in the optimizers we look at next we're going to keep the weights Frozen in fact we'll assume we don't have access to them um instead we are going to uh optimize the instructions so like the descriptions of the task that goes to the model as well as the demonstrations uh which are sort of these few shot examples that just show the model in the prompt you know here are some inputs and here's the corresponding Behavior we saw in the outputs that seem to work well in the past some
assumptions that we are making is that we don't want to um assume any access to you know log probabilities or model weights um because you want to iterate sort of fast you want to work in in natural language space with apis that are high level we have a lot of work where we relax this assumption and we look it into fine tuning um but that's sort of uh you know uh and actually there's we have work that shows you really need both uh to get the best performance in many cases but if you had to
pick one prompt optimization tends to be more powerful um we also assume no intermediate metrics or labels as we said earlier and we want to be very budget conscious so we want to minimize two things we don't want to ask you for a lot of example inputs of your task because it's hard to create inputs although technically inputs is the easiest thing to scale because you can build a demo and then put in front of your friends and then you have a lot of inputs when they ask questions you know if if your privacy policy
allows you to do that um but we don't we don't want to sort of ask for a lot of input so maybe want 50 or couple hundred uh you know not tens of thousands for example um and we don't want to call the language model too many times because it takes time or is expensive Etc so these are the assumptions that we want to sort of or the constraints that we have uh on this problem now there are two types of problems we want to tackle in general um the first is that well what's a
prompt like is this long sort of combinatorial string and it's you know anything really goes it's it's all valid so how do you even Explore the space especially if you don't have gradients and if you don't want to call the model too many times seems basically hopeless um and the second problem is well if you have so many modules which is really this is kind of an I think an even worse one when you combine the two together is if you have so many modules you know and you're changing things all over the place how
do you know what is leading to Improvement or what's sort of hurting because these pieces interact you know you might improve the prompt in one part locally but actually hurts because the other part made an assumption about the output of the first part and now you sort of have to account for these types of uh blame assignment so I'll sort of cover three methods uh this is not exhausted by any stretch of the imagination but these three are sort of like really good sort of uh different methods of of tackling this problem so the simplest
thing you could do in my opinion you know you know is is at least at least for the value it gives you is to sort of bootstrap or self-generate few short examples by running a dumb version of your program to build examples and then to plug them in and to search over that space and the cool thing is you can iterate that you can take the better program build even better examples you know build ensembles of them use larger models to build ex it's a very sort of large compositional space but in a sort of
simplest form um you know you have the program you take a training input you basically say like I'll execute the python in here some of these are just normal python you know return statements or Loops or I don't know like you call code executor or whatever but some of them are modules and so we sort of they are special they're we're sort of tracking them um and and tracing their inputs and outputs and basically we say well you know these things are generating sour queries and generating answ and if the metric at the end ends
up liking the answer and tells me that's good well that whole trajectory of like inputs and outputs seems interesting enough to keep it around now this can be a sort of a set of demonstrations of all our modules and we can simply basically say what if we take those we plug into them we plug them into the modules of our program and then we could do in the simplest case just a random search take a kind of a bag of of our subset of these demonstrations plug them into the respective you know three or four
prompts and take that run it on a small validation set maybe could be try to be smart about sort of how you're doing that evaluation and then basically look at that score and try to maximize that score so that's sort of like the you know I think the the the the the simplest successful thing you could try a different approach is to look into the prompt optimization literature not for language model programs but for language model calls and those folks sort of have a lot of Rich sort of uh you know uh concurrent sort of
research but under much stronger constraints so in the prompt optimization literature which is you know I mean these things are sort of of course morphing closer together but they started pretty separately um you are assuming that you have labels so you have inputs and outputs and it's one prompt that's the whole system there's no like program involved um and you're assuming that sort of there's this little part like you know a one liner somewhere that you want to search over the space of this sort of exploration is very large people do all sorts of things
from you know gradient based exploration to other sorts of forms of reinforcement learning and all types of stuff but one approach that's really cool and really high level is called optimization through prompting um from Deep Mind and the way that works is basically they want to plug in a little prefix at the end of The Prompt sort of to just get the model on the right track and so you know they start with think step by step but they want to end up at the end with like I don't know take a deep breath and
think step by step um and they show that that could actually help some models you know do a lot better it's a much smaller sort of scope but it's an interesting space of what to explore and we want to sort of see if we could take this and apply it to language model programs with the weaker assumptions we have so uh you know the way opro works for a single prompt is they basically go to a model and say here is the original thing give me 10 variants or 100 variants and let me basically just
evaluate all of them on a little CER evaluation set and let me basically um look at which ones work best maybe the top 10 go back to the model and say Here's the top 10 here is how well they performed give me more and the idea now is the model might be smart enough to say I see some patterns between the successful things and the unsuccessful instructions that I can sort of do essentially a form of um you know you might call that I don't know uh you know basically mutating mutating these these uh these
instructions and and I can basically try them again and repeat this process so that's the the the opro approach so how could you take sort of opro and naively apply it to a language model program well the simplest thing I think is uh coordinate Ascent so what that means is uh and we're not going to spend too much time here I don't know if the figure is big enough uh but you know let's say you have two modules the simplest thing you could do is basically go to the model generate a lot of these proposals
plug one at a time keeping the other one sort of fixed uh keeping the other modules uh fixed at the at the initial instruction evaluate all 10 approaches and then basically see which one worked best and maybe freeze it and sort of repeat for optimizing the next stage and it can sort of loop uh through this process um of of optimization the problem is this is horrendously expensive because you know every time you're exploring optimizing one module you're sort of running the whole program uh while freezing the other modules and you're making this big greedy
assumption that the best sort of uh choice at any given stage is going to be useful uh once you freeze it and try to optimize the rest of the parts you're not co-optimizing uh these pieces it's very expensive and actually doesn't even work that well a different approach here would be uh to to give up an assumption or to sort of make an assumption so you could give up on the Assumption of explicit credit assignment you don't really care uh to figure out sort of fixing all variables and changing one at a time you could
just ask the models to generate prompts for every stage all at once and Sort of hope that that process converges in a nice place or you could go to the model and say actually it's a bit of a harder task it's a bit of a harder ask but here is the sequence of all prompts of modules and I want you to propose you know new prompts for all of the other modules and you could try to do it jointly which of course has the risk of potentially confusing the model it's a much harder task but
it um you know is is is something that let's say um you know with with an oracle language model it's strictly more powerful by far now if you want this whole thing to work well in a language model program it is really important that you're not myopic about optimization so you don't just look at a signature just saying generate a search quity and you you know the user didn't say search quid the user just said quy is it a SQL quitty a search quy for Google for retrieval model that is you have offline these things
are different and sort of you at the mercy of the quality of the proposals you're getting so you might as well try to make sure that they're contextualized um sort of appropriately with respect to the uh program you have so that's this notion of you want the proposals to be grounded so sort of in a traditional prompt optimization setting there is the history of the prior steps how we got here what things we tried how well they performed and then there is maybe a training set of explicit examples that you sort of got from the
user because your whole system sort of doesn't have any missing pieces it's just input and output so um you know we have this intuition that to take this and to sort of generalize it to language model programs uh it would be very useful to provide a lot of contextual information about the program setup that we have um and so the first thing is because we don't have these training input output labels for every module in general um or for any modules in many cases we can start plugging in the bootstrap demos that we looked at
earlier so we you know they're basically as good um you know for all we know um and they seem to be successful when you're doing bootstrap you shot so maybe we could just plug them in as the examples that we'll use for constructing these prompts now that by itself is gives you a lot because you could sort of see what works and potentially what doesn't um but and here's a sort of an you know an example of a question you want to generate a search query um and sort of the reasoning that the model generates
when it's building the search query that that It produced that eventually sort of led to a trajectory that was successful um on a certain question the other thing is well you know you want to give the model sort of an understanding of the whole task so it would be useful if the of the if the model building your instructions could see a summary of the data set um that we're playing with um and so this is you know sort of an example of that where the model basically says looks like this is on the uh
multihop data set consists of factual trivia style questions across a wide range of topics Etc so you basically have a component that builds up the summary by you know essentially some sort of uh you could think of it as a map reduce over the data set with the language model um and then an interesting thing that you know you might want to ground your system in is well could it actually see the whole pipeline so it understands the role of every module in it and this basically works by literally taking the sort of you know
inspecting the code that you have for a program and then giving it to the language model that understands the syntax and basically it builds a representation of the uh you know uh the the program here in natural language that says hey we have a program that appears to be designed to answer complex questions by retrieving and processing information from multiple sources in this case it's set up for two hops Etc and you know the module in this program is responsible for generating a search query Etc and you might also uh sort of try to maximize
sort of the diversity of the stuff here by just sampling a random instruction that's you know that from the standard literature and just kind of plugging those in um for your uh proposer so like hey don't be afraid to be creative or all these things we could just take them out of the loop you know you don't have to write them we will write them once and then we can basically reuse them for building these uh these systems so we just discussed sort of extending uh optimization uh proposing with language models um so uh We've
covered sort of two two general strategies for optimization here one is building examples and searching over them uh the second is giving the model as much grounding as possible in the language model program and um asking it to propose instructions that we can sort of uh iterate over you know through learning from the previous attempts and then the last approach here is going to be a little bit different where our goal is to optimize both the instructions and the few shot examp but in an efficient way uh of of dealing with credit assignment the intuition
here is language models are not particularly good yet at uh sort of dealing with crit assignment themselves um but because we're working with all of these sort of spaces of discrete proposals like a lot of instructions or a lot of examples that we're building uh we could actually borrow a lot of intuition from the literature on hyperparameter optimization um and we can sort of build what is called a surrogate model uh a model that is basically um optimized to predict the quality of uh any configuration um possible of our system and that we can use
to uh sample sort of uh proposals for the entire system that we can actually test so Meo uh the MEO Optimizer Works in three steps first thing is it bootst demonstrations of the task we've explained what that means um it builds candidate instructions um using a language model program inside the optimizer that has all these pieces we looked at earlier like the summarizer the language model program you know uh uh you know describer all these all these little pieces and um you know that handles this form of credit assignment here by relying on just a
simple probabilistic model from the hyper parameter optimization literature so what we have here is a language model program with two modules and in every module we have essentially two bulky parameters one is what's the string describing the task which is the instruction and we get a basic one from the adapter that we discussed earlier and then no examples at first but this is the list that we can learn sort of uh as a bunch of these input output examples and that fact that we've done the bootstrapping and the candidate proposal in a grounded way means
that we can basically start exploring this discreete space of what is the right combination across these modules of instructions and of sets of examples or lists of examples in each module that would lead to the highest quality and of course we can try all of them so we're going to rely on this ban Optimizer to sort of basically um help us uh give us an acquisition function that allows us to sort of like make good guesses as to which combinations we should try once we pick a pick a com a combination we can plug it
into a version of the program and then basically evaluate it on a mini batch of the data set that we have on the validation set that we have so maybe the data set has like 200 examples we could sample 30 of them get a score and then go back with that score let's say 75% 50% whatever and feed it back to update the sort of the surrogate model that we have so it's like hey if this is the set of combinations I chose that got gets me I think 50% I think because well depends on
the random sample in the mini batch um so in the future you know we're trying to sort of uh uh get an acquisition function that sort of has a one or you know let's say a property like I I would like to get the most promising uh Improvement uh in in future proposals and there are of course many sorts of such choices you could make and we can sort of repeat this this process and trials um until um we sort of uh find a really good combination at the end of the day and what's happening
here is that this thing gets smarter over time so the combinations of proposes tend to go up um sort of over time as opposed to say random search where you're just trying combinations so how do these things compare uh we looked earlier at a slide that had results sort of with like pretty powerful optimizers with like yes optimizing or like just using the offthe Shelf thing we also looked at the results sort of with people writing prompts by their own hand but it's also sometimes instructive to try to think of like ways of benchmarking these
different optimizers against each other it's actually really hard and it's a place where we need more contributions because you want to build tasks that are representative tasks that are hard not overfit by given language models but where you also can sort of build interesting and realistic compound AI systems that we can sort of um optimize and explore um so an attempt at this is Lang probe language model program Benchmark um that uh sort of has these tasks so like multihop stuff classification inference whatever and you know some of them have two modules some have four
modules some have one module so here are some sort of initial kind of part of the of the test set results on a bunch of these tasks um and we're looking at optimizing instructions only uh through we looked at sort of module level Oo we discussed before without grounding and with grounding and then zero shot me Pro does not allowed to use examples but only uses instructions and the first thing that jumps at you is that well you know on average optimizing instructions compared to like using a you know basic kind of adapter can help
quite a bit so you know can get a few points here or uh you know a bunch of points there can get some nice BMP there uh sometimes it overfits to the training set if it's really small and it's actually worse um but in General it really does struggle with a lot of these tasks to optimize prompts alone instructions Alone um and it's not really obvious which of these wins if you look into optimizing with demonstrations that picture is pretty different like you can already get you know 10o jumps um several Point jumps here uh
several here as well you know overfits here um and it seems that instructions are actually quite powerful um and you know here you can get very large gains as well and the interesting thing is when you look under the hood at sort of this process of just random search over bootstrap demonstrations that the system generated um it's wild how much they vary you have stuff that's worse uh here than the initial sort of zero shot approach but you have stuff sort of on the frontier that's way better and it's really interesting what you know all
of these were generated in the same way um by running the the program and basically doing rejection sampling so it is interesting to see like how big of a difference uh sort of they can make when you plug them into the program and the last thing is well you could take me Pro which does few shot instruction optimization both together um and you can get sort of the best results you know quite often most of the time with with some nice jumps across various pieces of these um an interesting thing is well you know optimizing
examples tends to be a winner but there's sort of patterns where you can spot that focusing on instruction optimization is actually leading to more visible improvements uh which is basically cases where the original basically cases where the task has this property of like a conditional um you know uh pattern where seeing one example doesn't actually teach you the whole pattern or seeing two examples doesn't teach you the whole pattern because there are sort of more cases than you can cover with just examples or the kind of the uh precise threshold or the precise kind of
region in which a rule applies is not clear from examples by themselves so kind of concluding here um and maybe opening for questions more explicitly since they're open anyway um sort of just some set of lessons on what you might call natural language programming which is sort of what these Set uh what this set of abstractions in dpy allow us to do um a big lesson uh not to forget compound AI systems is that programs can be a lot more accurate more controllable and more transparent than using deep Neer networks alone models Alone um and
that we sort of just need the cative program you don't need to sit down and write sorts of 10,000 token sequence of five prompts um you can really write 10 lines of code just Express an objective and pick an Optimizer and run this thing on your favorite language model um and the high level optimizers can bootstrap you know examples for prompts or propose instructions and explore that space and that they're pretty bad at any individual proposal on average but you can sort of uh run a large scale search on this that does actually find things
that in in some cases in many cases uh sort of outperform the stuff you you can do by hand but the real power here is because you're freed sort of from the choices of of how do I tune each module you can explore the space of compound AI systems more directly where you're getting a lot of value from iterating on the modules You're Building how they're connected together um you know what is the exact right objective that I want to maximize in many cases it's not obvious um etc etc so dsy makes it possible to
approach this sort of natural language programming by sort of yanking out uh handrim prompts and giving us these this notion of signatures um throwing away prompting techniques and inference time strategies um that are sort of really fuzzy and you know you know what is a I don't know what is a Chain of Thought really uh and giving you actual sort of predictors uh you can compose as modules that can take your signatures and sort of apply in a metaprogramming sense the strategy on top um manual prompt engineering you throw that away and you sort of
work with optimized uh programs where tuning the instructions uh or the weights sort of in a given uh strategy and this is something that's sort of being widely used in production and in open source everything is at ds. a including links to all the papers I discussed or didn't discuss uh so it's used a jet blue data bricks Walmart VMware repet Haze laughs you know Sephora Moody and other stuff you can sort of find a nice list of public uh facing use cases people that are sort of fine going on the record uh sort of
to learn what people are actually doing with it this and it's a really great way actually to learn of like not just what people are doing with dpy but like what kinds of compound AI systems are getting actually deployed and sort of who's deploying them and how are they optimizing them many of these folks sort of go on podcasts and describe how they're optimizing their systems for I don't know law firms or other things um yeah so this is sort of a nice uh great folks and in the open source Community uh cool collaborators that
sort of make all the stuff possible and uh you know you could pip install DSP and get started uh right now just to conclude key lessons on optimization in natural language um in many cases in isolation nothing seems to be building good examples of the task automatically what we call bootstrapping um seems to be this notion of show don't tell but on tasks where there are these conditional rules that are sort of scoped in hard to to detect ways optimizing instructions seems powerful and I should make it super clear that the biggest coolest thing in
dspi is that we've isolated signature from optimizers from adapters from you know metrics and basically and from inference time strategies and that you can iterate on any of these four or five things uh independently and compose with everyone else so all of the programs that exist right now if you decide to build an RL based prompt Optimizer you know or if you decide to introduce an inference time strategy as a predictor everybody could just like you know change one line in their code and then you know you could see boost across the entire set of
you know use cases that exist out there if you do it well um and you know there are sorts of tricks that make this work that we've discussed here all right so actually I had this slide and I like to use this slide um I you know but right before the talk or or the lecture someone told me you know uh many of the other lectures and maybe I'm counted with that uh you know proudly but sort of are from sort of large closed laps that don't publish they're certainly not academics certainly not anymore and
they seem to be leading all the progress and whatnot and so so I really want to say that like you know a big goal of dpy sort of a meta goal is to enable open research to again lead uh AI progress and open research here has a lot of advantages for us in terms of like you know uh why we'd want that to be the case it's not part of the scope of the stock um but I'll tell you how this DSP really makes a difference in this in this space uh and it's to basically
uh show that that progress is really going to come through modularity so we're not asking people to sort of figure out sort of how to fund you know billion dollar runs uh in isolation or these ad hoc tricks that sort of you apply and then you know two weeks later the model changes and it's not really very relevant anymore but instead sort of we've sort of outlined this space that I hope I convinced you uh of is you know of basically how do you Scope uh out your programs well how do you develop General uh
inference time strategies that act as predictors that anyone could apply to their signatures and how we can devise new optimizers that could allow to any could apply to any of these programs that sort of uh give us systems that are stronger than the sort of the sum of their parts if you will um and you know through that I hope that a lot of open research and optimizers predictors modules and whatnot can sort of lead to the type of progress we we saw with neuron networks where people you know different people developed attention uh Transformers
convolution and other things in a way that is highly distributed and was obviously incredibly successful uh as opposed to a you know maybe the way in which maybe we iterate on large language models in a closed way now