Yann Dubois: Scalable Evaluation of Large Language Models

2.43k views17748 WordsCopy TextShare

Mayur Naik

Guest lecture by Yann Dubois, Ph.D. Candidate, Stanford University, in Prof. Naik's course CIS 7000:...

Video Transcript:

get started everyone um it's my pleasure to uh uh keep up our get series um very first speaker is uh Yan uh he's a final year PhD uh candidate uh at Stanford University uh in uh computer science advised by bang and his research for focuses on improving the effectiveness of AI when cles are scarce and more recently he has been part of the albaka team working on training and evaluating language models more efficiently using other LLS and that is what his talk is going welcome Yan um and I'll hand it over great well thanks for

having me um then we can get started one thing is if you have questions uh maybe wait until the end simply because there's a lot of people um so it's going to be easier I think if we just have questions at the end um so yeah but if there's something that is really unclear or if my mic stop working again of course uh yeah start like tell me because I don't actually see any of the messages great so let's get started um so I'll be talking about scalable evaluation of large language models uh so as

exactly so as many of you know you all know about llms uh given that you're in this class and LMS are all over the news so those are some of the application that you might have interacted with um with llms and what I'll be talking about is how do you evaluate the performance so first I'll give you a quick overview of evaluation of LMS and and evaluation generally in AI um so why is evaluation important uh and what is evaluation so when you develop an AI pipeline or an l you usually have a goal in

mind so evaluation is simply quantifying what the progress that you're making towards that goal and why is that useful I I'll give you a few examples um one thing that you can do with evaluation is that you can identify improvements that you could do in your in your model development pipeline so you can understand how to better um design models another thing which is which evaluation is useful for is uh if you ever used llms or open source LMS you will see that there are hundreds if not thousands of LMS out there so it will

allow you to select the best model for your own use case another reason why evaluation is important is that even if you already selected the best possible model you might not know whether this model is good enough to be put in production or if whether your your entire AI application or your entire AI system can be uh is ready to be put in production so evaluation tells you also you can have some threshold that it can tell you or make you confident that your um your model pip your model or your AI pipeline is good

enough to be put in production so those are some of the things um for which evaluation is useful great so evaluation there's a lot of desired properties that we would want our evaluation pipelines to have I'll briefly mention a few of them uh this is definitely not exhaustive uh one thing is scalability so this is actually in title of my talk uh scalability is really important because we usually call these evaluation pipelines many times and I'll show some examples later uh so if you call these evaluation pipelines many times for example When developing a model

you really uh don't want to be waiting too long and also uh spending too much money uh for every call that you do to your evaluation pipeline another thing is relevance of course you don't want to evaluate things that are completely irrelevant to your application or to your use case uh maybe something like discriminative power um so what I mean by this is that you you want your evaluation pipeline to be able to discriminate between models in a very fine Green Way uh so what that means is usually you want your evaluation pipelines to be

uh complicated enough um to for example distinguish between a l 70b and a Lama 7B um if not it's not that useful but if it's too hard and all the models are not performing well at all and basically they're all random it's not useful either so there's a fine there's a a fine balance between kind of Simplicity and and complexity of the of the pipeline which depends on the current uh models to be able to discriminate between different models another thing is interpretability uh if your evaluation pipeline is very interpretable then you can uh take

action on it which is very useful for example for mod model development to know what are the things uh that you should be changing uh reproducibility this is really important for example in Academia where we keep kind of citing the same numbers and and you definitely want number to be reproducible and you you want in in a very broad way your valuations to kind of um not be biased and I'll talk about these things in more details later um the actual desiderata or the actual desired properties really depend on what you're going to use your

benchmarks for for so one thing I I really want to emphasize here is that as I already mentioned briefly before there are few different use cases of evaluation pipelines and for different use you will have different desired properties uh there's definitely not a single kind of evaluation pipeline that will be perfect for all your use cases and you have to think every time that you you develop you build your own evaluation or when you choose what evaluation to use you should really be thinking what are the properties that I want in my own evaluation why

why am I actually evaluating my models for example if if your goal is development um for example for knowing how to improve your model uh then skillability is really important because because for example when you do development I'm sure you have some experience with um uh with high proi tuning and just training uh machine learning models when you do high priority tuning you need to call uh basically a benchmark or some evaluation pipeline every time you try a new Hy parameter so that's going to be uh very expensive if if your evaluation pipeline is expensive

um another thing which I already mentioned is that you want discriminative power and when you do model development by changing your learning rate by few um like 1 E minus 4 1 eus 5 you it's going to make a very small difference so you definitely want your evaluation pipeline to show that difference if it's for academic benchmarking uh which I'll talk about later uh I already said reproducability is crucial another thing which is important is robust to gamification uh so what I mean by this is that there will be many people when you when you

release an academic Benchmark there will be many people who will um try to publish papers uh and kind of optimize that the model model for your benchmark uh so you don't want uh them to be able to really uh gain your benchmark in some ways uh if if a goal is kind to develop it in an application and maybe to put it in production then uh interpretability is really important you want to know what is going wrong uh you need to trust your pipeline because if if you actually making a decision on whether to put

something in production you need to make sure that your P that your evaluation pipeline is uh is uh trustworthy and relevance of course if if your application is for example in coding uh it's really important that you actually evaluate coding instead of for example math so those are just a few desired properties and really what I want to emphasize here is that there are many different properties uh you will never find a single evaluation pipeline that will have all these properties so when you develop your own LM or when you use your own LM or

evaluate your own LM you should be thinking about why am I do actually valuation and what are the things that I need uh and you will have different benchmarks that will be good for different uh for different properties and I'll be talking about a few of those benchmarks so evaluation recipe uh the recipe to build an evaluation pipeline is actually very simple at a high level the high level is that you need to decide an you need to decide an evaluation data set so you need to select an evaluation data set so you can think

about it as as some questions to the to the llm and maybe some addition information so some benchmarks will requires for example the gold answer um so this is just the the data set um and then the other thing which you really need to have is the evaluation metric uh by this I mean for every example you need a way um to to quantify how good the answer of your llm or of any AI pipeline um is on that single example and usually when you have these two what happens is is that you take every

example you uh quantify the performance of the output of your AI model on this example and then you aggregate it over your entire uh data set typically the aggregation is is just averaging so classic AI evaluations I want to just set the stage a little bit and then I'll talk about uh why llm evaluations is actually a little bit challenging uh so in classic AI evaluation you usually have a very clear task in mind um so for example if you're building a dog classifier uh the task is classification of whether it is a dog in

an image or not um so here it's very very easy to decide what uh data to use for evaluating it's basically uh data that is very similar to the the data that you will um encounter that your AI model will encounter when deployed in the real world so you will try to find ID data um from the test data that you will actually encounter the real world so basically all that means is that you will want to have uh images uh of things you want to classify whether they are dogs or no dogs in those

images um in terms of metrics um usually these tasks in classic AI evaluation are what we call close ended so Clos ended means that there's a limited amount of possible answers for example in in in classification is really uh at least dog classification is really only two answers either it is a dog or no dog uh so there's only very few possible answers um and the solutions are usually known and objective so in this particular case um is there a dog yes there is it's a clear and objective uh solution um and and as a

result uh finding metrics or building metrics is actually very easy because you have this clo close-ended task the only thing you need to do is for example you predict whether there's a dog in this image so this is the the prediction from your model you simply compare uh whether you were correct or not if you were correct you get a one if you are not correct you get a zero and then you average this loss or this metric over your entire instruction set uh or entire uh data set and you get this accuracy so in

this case you have an accuracy of zero um so the fact that you have very clear tasks with close-ended uh that are close-ended means that it's very easy to build automated uh automated benchmarks um and automated means also scalable benchmarks so how is that different with llms so with llms there's a a very large set of tasks that people will use LMS for uh so this is just an example from shock GPT uh which is um basically the the the precursor of chat GPT uh and you see that people use it for generation for open

question answering brainstorming Chad rewriting and really there's so many things that people could ask uh to these models and not only this uh the actual task or the this distribution of task is usually not known uh When developing the model uh so if I'm developing my own model and I will just release it for example on hogging face I have no idea what people would will use it for um so it's actually really complicated because you don't know what people will use it for and not only do you not know what people will use it

for but also the task uh that people will use it for will usually depend on the quality of the model uh so if the the model is really good maybe uh if it's really good at summarization maybe they will use it more for summarization um so this this this tight uh dependency which makes it hard uh for for building evaluation pipelines because you don't know what a what is the real task uh the other thing is that it's open-ended tasks in oh I have a chat question um oh I think someone is asking yeah about

the volume I'll continue so uh another thing is that it's open-ended tasks uh so what I mean by this is that instead of this image classification where you have a single um a single possible label here when you when I'm sure many of you have interacted with chpt the type of question that you will ask might have many different answers so for example here I just gave uh one example that I found online where someone asks um Shad GPT to find three Instagram or to to write three Instagram story IDE is for a plumbing company

relevant to blah blah blah and it is informative and educational but still fun um there's clearly no right answer there are many possible answers and and uh you cannot even enumerate all the possible answers uh and there's a Continuum of quality it's not as if there's one right answer and one wrong answer uh there are many possible answers that are in this Continuum of quality and not not only is it on this Continuum of quality but also people might might disagree for example I might think that that the first answer is better while someone else

might think that the second answer is actually better uh so that makes it very hard uh to evaluate LMS uh and to build automated benchmarks so I I'll be talking about basically how do we um overcome this challenge of having this diverse set of tasks and more importantly having this open-ended tasks great so as I said before there are two things that you have to think about when building an um an evaluation uh pipeline or Benchmark first one is what instructions do you use and the other one is the metrics that you use in terms

of instructions I really I'm not going to give you any um smart idea here I might a little bit at the end but really what people end up doing is kind of brute forcing their way uh so what I mean by this is that they will just try to collect as many representative instructions or questions to an llm as possible um and kind of cover as many tasks or as many potential tasks as possible uh that will be representative of what this llm will be used uh Downstream so there's really not nothing smart uh down

here although I'll mention maybe some um recent ideas in the community uh later so I'm going to focus a lot on the metrics that you use so as I said the challenge is that it's open-ended um so now I'll talk about a few ways to uh uh bypass this this challenge or overcome this challenge the first one is actually to convert this open-ended task into a close ended task so basically to bring it back closer to classical evaluation of AI models uh so for example what you could do is instead of asking some if if

you want to ask some questions about supernovas instead of asking a free form text about supernovas you can ask it some multiple choice question you could ask Chad gbt hey is it true for type 1 a supernovas and then you ask a few there are a few choices and you ask chb to say which one is true so that will constrain the type of outputs that you can give and now this is much easier to evaluate because now given it's not a free form um uh text you can you can just say whether the you

can just check whether the correct answer uh because you can just know what the correct answer is you can collect that and you can just check whether the correct answer is the the same one as the answer that is provided by the model uh so usually how how this is actually done I'll talk about it later but it's that you take an llm uh and you basically look at the likelihood of generating these four sentences um and and you just check which one gives the LM for which of these um choices does the LM give

the highest likelihood um so in other in other words what choice is the LM most likely to generate uh and then you can just compare uh this the the the best like the most likely generation uh with the correct answer so this is this idea of of converting it to close-ended task the the benefits of doing it that way is that it's very scalable and you really understand what is going on uh the the downside is that now you Chang the task you went from an from an open-ended uh free form text to something that

is close-ended with only a few choices uh so that will basically will evaluate something a little bit different than how the models will be used in practice great so a second uh option is to use what is called reference-based uh heuristics uh so by reference based what I mean is that you can collect a real answer so to every question you can collect a a potential solution uh as given by some human or by some expert um so for example here I have I don't even know what the question is but here here's the answer

um that is the reference one so imagine that the qu I don't even know what the question was but maybe the the human says that the right answer is they walk they walked to the grocery store and the generated uh answer is the woman went to the hardware store then what you can do is you can start comparing or asking yourself how similar is the answer generated by the LM to the answer collected by the human so the potential answer so there are many potential um metrics there so usually how how these are done for

example blue and Rouge um if you ever heard about those those are more classical I would say evaluation metrics um what they do is that they check the engrams so that means they check the presence of um of uh um multiple words that are next to each other for example two appears in both sentences and it looks see that the recall or the Precision um of yeah basically how many how many engrams are in this sentence that are also in this sentence um another one is Bird score which would basically take a bir model so

you can just think about embedding this entire sentence into some um Vector representation uh that that retains a semantic simil similarity and does the same thing for the reference and just checks whether the two are similar or not so another way um this reference based heuristic what it does is that it doesn't even look at the question all it does is that that it collects for every question a potential answer and it tries to compare the generated answer with the potential uh human answer uh so in some ways it tries to do the same thing

as with accuracy but instead of using a hard metric like the 01 loss which means either you have a one if it's completely correct or zero if it's not it tries to use some smooth version um by looking at the the the the similarity between the two sentences uh the benefit is that you didn't change the task here uh the downside so we we still have free form uh Generations the the downside here is that there are many potential Solutions so here that that would work actually pretty well if this reference answer was the only

potential answer but in reality if you go back to the previous example I showed you where you have to generate um Instagram stories uh there are many potential Solutions and there might be one very good solution that is actually very different than the one of the human Road um so as a result using these type of metrics will will might wrongly tell you that your answer is completely wrong when actually it's a it's a great answer um so just think about it as like writing a book maybe the human wrote one type of book while

the LM wrote a different type of book so when you look at this similarity between the two books it's actually going to be very low uh even though the book might be great great um so another potential thing for metrics is to ask uh yeah another potential way of of dealing with this problem is to Simply ask humans to evaluate the output of the models uh so you could just think about it as humans look at the questions and the answers and they just say how good the answers are um so the pro the the

benefit is that this is really the desired kind of um evaluation in the sense that you didn't change any of the tasks that you're performing uh the downside is that this is really not scalable and because now you need to have humans that that are in the loop and they really read every possible answer of your model or all the answers given by your model uh the other uh downside is it's also not reproducible because humans are um have very high viance so you will get very different results every time you use a different human

fation and finally uh one aspect that I'll talk about a lot about in this talk is that you can replace you can try to make the LM evalu sorry the human evaluation uh more skillable but instead of of asking humans to evaluate the answer you can ask llms um so you can ask for example Chad GPT hey Chad GPT here's the question given uh given to I don't know this slam 7B here's the answer how good is this answer for this question um the benefit is that we still didn't change the task and now it's

very skillable uh the downside is that some people um at least currently there's a lack of trust I would say in these model evaluating um other models um you also don't have much interpretability you don't really know what the LM is evaluating um and also it requires um oh someone is drawing um also it requires an oracle llm so you require uh having a more powerful LM uh for doing the evaluation which means that you cannot really use at least not how I said it that this way um this type of evaluation for evaluating the

the state-ofthe-art uh model okay oh how do I remove okay I still see the drawing and I don't know how to remove that oh thanks great so now I'll talk about now that we uh saw this overview on evaluation of LMS I'll talk a little bit about uh some of the LMS that are open um and that people use in the in the open source community and in the academic community so usually in in the academic open source Community I would say that benchmarks are really used for evaluating a broad set of performance uh so

for example think about it as having uh the different open models and maybe Chad GPT maybe like gp4 and and and some Cloud models and you want to know which one is the best uh possible model um at this point so you really want to have a broad evaluation of um of the quality of a model and use cases well as I said before it might be Ed for development so when you develop your own model you want to know how well you're performing on a broad set of of tasks uh model selection um so

if you build a new a new application and you really don't know where to start with like which model to start because there so many choices uh you might just take the one that is the best on the current benchmarks and also a lot of people use or a lot of companies use these benchmarks for kind of PR so they say hey look you should use our models because um they they are the top ones in this in on this Benchmark so I'll talk about a few uh potential ways that people uh do evaluation uh

in in the open source commune and academic commune one is simply looking at perplexity um so hopefully you all know what perplexity is but if not I I'll actually give a pre brief um I'll briefly overview what what perplexity is um the second option that I'll talk about is what I call close ended kitchen sync so that means means basically take all the close-ended tasks that you might perform and just evaluate your model on all of them uh thinking that basically if you cover so many potential close-ended task um and and your model is good

at all of them it's probably good at your own task too um another one is just uh having some chat with humans uh so this is this human evaluation thing that I talked about before and the last one is this LM based evaluation and that's what I'll talk most about great so perplexity uh so one thing which is uh definitely used in the academic uh community and and even actually in companies is that people don't really show the numbers um is looking at perplexity which is essentially looking at pre-training validation loss um so I don't

know how much you saw already in this class uh but B basically these LMS are trained to maximize the likelihood of the generated text and uh the laws that you use is this cross entropy loss uh which is simply generating basically giving the likelihood of generating the next uh token given a previous tokens uh and perplexity is simply a more interpretable um version of this validation loss uh where instead of taking just the loss you take the average of the loss over the entire sentence this is essentially to make uh your loss independent or somewhat

independent um of the length of the sentence that was generated and then you take this exponentiation here so two to the power of the average loss and this is um because you want the loss to be essentially independent of the base that you're using in your um log loss so perplexity is essentially between one and the size of the vocabulary it can actually go higher but like for uh most models it will be between one and the vo and the vocabulary size and the intuition that you should have of complexity is that it basically tells

you the number of tokens that your LM is EST stating between uh so if your model is really not sure between two words um you will basically have a perplexity of um of two um so here is the test perplexity of different models between 2017 and 2023 uh I think I believe it's on Wikipedia um so I just wanted to give you a sense of like the perplexity number basically between 2017 and 2023 the models went from 70ish perplexity to less than 10 so that means that models right now are asting every time they want

to generate a Wikipedia word uh they are asting between essentially 10 words so they're pretty uh they're pretty confident um on the on 10 potential words uh for every new token that they generating um I see a question does it take into account the embeddings of the words uh so I don't quite know what you mean by the embedding of the word here but it basically this is dependent on the tokenizer so I'll actually mention that really briefly later um so perplexity is not used and not that much use for academic benchmarking but it's it's

actually really common for model development so especially when you do pre-training the really the loss that you're trying to minimize is the perplexity or your validation perplexity so this is very common is that people people don't show these numbers anymore and the reason why they don't show it anymore um is because it actually really depends on what data you're training on uh if if the training set um sorry if your validation uh set or if the set that you're using for for computing your perplexities in your training set then you will get very low perplexity

um so that's one problem the other problem as this question kind of just alled to um is that the perplexity depends on the tokenizer that you're using which means that when when different models use different tokenizes as is uh the case um then you really can't compare between models so this is only really useful for kind of model development where you have a fixed tokenizer so why is perplexity still useful um basically perplexity is extremely highly correlated with Downstream performance uh so in general the lower the perplexity the better you will do on most Downstream

performance it's actually very hard to find data sets uh or benchmarks on which better perplexity will will lead to worse results but as I said it depends on the data and the tokenizer which is why people don't really use perplexity or at least they use it internally but they don't um they don't show it externally uh the numbers that they get so the pro or the benefit of uh of perplexity is that it's very simple to compute and you know exactly what you're Computing uh the downside is that it's actually looking at a different kind

of task which is is just generating next token which is different than how we're going to actually use these model which is for this free form type of generation and as I as I just said uh you can't compare between different models um because it depends on what data they were pre-trained on and also what tokenizer they use um yeah yeah as as mentioned in the chat uh yeah different tokenizers will have different questions yes I will send the slides later sorry different to will have different number of of different vocabulary size great so uh

close ended um fiction sync so I already mentioned about I us talked about perplexity as I said uh after this um another kind of way that people are EV valuing these alms is just to look at all possible kind of tasks uh that are pretty good and and that are easy to to evaluate so they are close-ended uh so I would say that there are two I mean there a little bit more than that but there are two main kind of close-ended kitchen sink um or these aggregation these aggregated benchmarks of many different benchmarks one

is hel and the other one is a hugging face open LM leaderboard um and basically what they do is that they just take many different um benchmarks that were released uh in the open source and they basically average or they look at the the average result on all these Ben marks and they kind of just choose how to average stuff and which benchmarks they should include and which benchmarks they should not so I'll talk about the openlm uh leaderboard um because this is really what is used the most at least in the open source Community

um and there's also a new version that came out I believe two months ago so I'll talk about the new version and then I'll briefly say uh um why they actually had to change it so I'll give a few examples of data sets or benchmarks that are in this open LM uh V2 so this new version of the hugging face open LM leaderboard uh so one of them is um what is called mlu uh Pro so what this does is that it tries to evaluate the knowledge of your model um so this is just one

example that I'm giving uh so this is an example from this Benchmark the Gypsy moth produces that's the question gypsy moth produces a natural attractant blah blah blah blah blah and then it it tries to ask what quantity of attractant will diffuse to the uh orifice in the same amount of time and you have to do basically some um some simple uh uh math I believe um and yeah have some knowledge about chemistry and there are 10 potential options um and you know the correct answer is I so this one this out of 10 uh

so as we saw before this is multiple choice question um which means that it's actually very easy to evaluate uh this this Benchmark because all you have to do is take a model uh look at the likelihood of generating each of these um uh each of these choices and just choose the most likely um Choice and then just do basically the accuracy or the 01 loss between the most likely uh Cho uh Choice as generated by the model decided by the model and the correct answer so the reason why this is called mlu Pro is

that this is a more um this is a a new version of what is called MML which was a previous previous Benchmark mlu is really the most common I would say knowledge Benchmark uh so it has many different uh kind of University maybe undergrad level questions in chemistry and biology and and and many different domains um and the reason why people had to uh go to mmw Pro is that uh the newer models are doing too well on mlu uh and basically it reach the level of essentially human human performance which means um given that

these benchmarks are usually a little bit noisy so there might be some wrong answers uh that's the best we can do on MML so this is a less noisy data set that is harder and what type of uh of questions people ask well what are the questions in this in this MML Pro uh it's a lot of math physics chemistry uh law and some engineering and like really tries to cover as many possible domains as possible uh and this is as I said really bad knowledge um so yeah this is this mmu Pro great uh

another another data set that is in this open L mv2 um Benchmark is also about knowledge or measuring knowledge um of these models this is called GP QA um so this is uh for questions that are supposed to be uh WR so they're all WR by experts they are supposed to be hard for nonexperts that are even if they use Google so basically they try to find questions or to write questions such that experts uh sorry nonexperts that search on Google for like 30 minutes cannot find the answer uh to this question uh so this

covers three domains biology chemistry and physics and this is one example here uh where it's about quantum mechanics and there are full potential answers and uh I mean definitely not know the answer just by looking at um at this question and it's supposed to be that I wouldn't be able to find um the answer in 30 minutes on Google or on the web even if I tried to um yeah this is this other uh Benchmark so those again are all the benchmarks in the open LM V2 leaderboard another one is called uh MSR uh this

is about reasoning and uh long context reasoning actually so uh this is a a Benchmark that is automatically generated uh the I actually cut this example because it's around a thousand words for every instruction or for every question so it's very long and it's essentially about murder mysteries uh some object placement question and some team allocation optimization uh so here for example for the murder mystery there's an entire um a few paragraphs essentially um explaining uh the the background of the murder and then it just ask who is the most likely murderer um and the

way it's generated is that uh there's uh yeah the way it's generated is that there's a lot of information about about this Mur case and they try to remove uh automatically part of the information and then generate an entire text based on on this U removed information or this the the information is remaining okay another one is the math data set so this is high school level math uh so the question is um here the problem just to give you one example the equation x^2 + 2x equal I has two complex solution determine the product

of the real parts and the solution here that actually the only thing that matters is the thing that is in the Box uh so what they do is that they simply compare uh between the solution from the box uh in the real human solution and the LM generated solution in the Box um okay so this is high school level math uh another one is called if evalve for instruction following evaluation uh so this is actually pretty interesting Benchmark uh this doesn't uh look at the content of the answers of the LMS so all it does

is that in the instructions so in the questions that they ask to the LM uh they they have some uh some formatting um rules or some formatting uh uh instructions so for example just let's just look at this prompt write a casual summary of the US maternity leave policy with two sections and at least least 25 sentences and how they how they will evaluate that that is free form generation uh so the model will will write a free form answer as we said this is hard to evaluate a free form answer so what they will

do is that they will not evaluate at all the content of the answer they will just check whether it has at least 25 sentences and whether it has two sections so basically will just check whether the formatting is correct which I find very interesting because it it's it basically Sid step the the the problem of um the evaluating the content of this response being C uh so it just looks at the at the actual um instruction following ability of the model for formatting um so yeah so basically if you look at this and also maybe

mlu Pro then you will you will hopefully have some um some very orthogonal type of benchmarks this one is more about formatting instructions and the other one is about knowledge um another and I believe it's the last one in open lmv 2 is BBH uh for big bench hard so big bench is a big Benchmark you don't need to know about it and they only took a um challenging subset of 23 uh tasks uh and here are some question this is still multiple choice question um although it has some few shot it's not that important

the question is here for example which sentence has the correct adjective order and you have you have to choose between uh these two or which uh statement is sarcastic and you have to choose between these two what is one word uh difference so this is everything about open LM V2 I believe um so now you get a sense of all these benchmarks being uh artificially easy in the sense that it's artificially easy to evaluate these benchmarks so they don't none of them actually evaluate the real thing that you care about but given that the there

are many type of benchmarks that you're evaluating on and the hope is that you cover a lot of different aspects of of what good LM should be um what LM should be good at and as a result if if your LMS are good on this um meta Benchmark uh then they should be pretty good for your own task um so let me read uh oh I don't see the messages any oh here chat uh okay I'll uh since kitchen sing to transform the question um yeah you you definitely need to look at potential biases and

actually a lot of the benchmarks that were that were previously written there are actually a lot of biases in these things where maybe uh well it's not even biases maybe the the model had finds some spous correlation and finds that maybe for example the long answer is already is the correct one uh so you definitely need to analyze a lot of that um I'll talk about problem that was mentioned uh right now later so okay difference between V2 and V1 that changed a few months ago as I said they basically made a more they basically

replaced some benchmarks with more challenging ones so you had mlu and now you have mlu Pro uh for math you had GSM 8K and now you have math um and for another knowledge one you had uh Arc and now you have GP QA um and so they basically made the benchmarks harder because uh the models were just becoming to good on V1 and they also started using a different aggregation uh so before it used to just be average so they used to average um all these benchmarks now what they do is something slightly smarter for

example in the case of BBH there only two choices uh which means that really a random Baseline would would have 50% accuracy already while for MML Pro there are 10 choices which means that a random uh um Baseline would have only 10% accuracy to 90% error uh so now they basically um look at the difference between the random Baseline accuracy and your real accuracy that you get for your model um and and that's kind of the renormalized metric that they use so in some way you weight higher the ones where you have um fewer choices

okay uh so close ended kitchen thing just to close that part of the the talk um the the benefit is that it's very simple as you see we can go through every benchmark we can understand what's happening none of them are great but hopefully together they kind of cover a lot of things that we want our models to be good at uh the downside is that it really doesn't evaluate the thing that we care about which is open-ended generation in the first place um so yeah great so I'll talk now about human evaluation and I'll

specifically talk about one very famous Benchmark that is done with humans and this is a Chad B Arena so um I'm sure many of of you have heard about it uh the idea is that you would have uh hum uh humans so um any any of you can actually uh take place in this Benchmark and be used as if the human annotators you will you will interact with two models in a blinded way which means that you don't know which which model you're interacting with and you will basically ask question these to these two models

and then you will have to say which of the two model is better uh or like which of the answers from the two models is better um so so they collect in this way a lot of different uh up votes or a lot of different comparison I believe I mean this on this screenshot it's 300,000 but that's probably from from early this year or even last year said they probably collected more than a million already now um so yes that's just chatbot Arena the type of of questions just for for you to get a sense

of these things the type of question that people ask in the chatbot Arena are things like what's the most popular item on the menu of a Subway in Taiwan or make a trigger bar in GT A5 um and yeah this is the type of question so of course you get a very uh biased type of question because it's only people who are actually know enough about llms or like are interested enough uh to interact with these LMS so it's a little biased type of distribution but it's very interesting so here's uh talking about this this

distribution here's a type of T that or instructions that people ask to these models uh there's a lot of coding a lot of Technical and software ated questions which makes sense given it's people that are interested in LMS and there some like role playing okay so that's about Chad body Arena um so human chat the review of human chat the benefit is that now it's finally evaluating open-ended uh task um so really now it's free form generation you just ask any question you see the answer and you just have to say which one is better

one of the downside is that it's hard to scale uh so they were able B to scale it uh because they basically allow anyone in in the world essentially to interact with these models but it still takes a very long time to get uh results on these models and usually it's only only the best models or at least the models from the best company from from the from the famous companies are essentially evaluated on this on this Benchmark um which means that if you're if you're developing your own uh your own llm at your place

you will definitely not have 100000 people interact with yourm uh for every new learning grade that you use so you can really not use this for model development another downside as we said is that you um I mean you have a very bias type of of distribution of of instructions that people ask to these models but uh but overall it's a great great Benchmark okay so now I'll talk about LM base and this is probably what I'll talk um uh for for the most of the of the yeah we see most of the rest of

the talk uh so here idea is uh to try to take best of both world right best of both WS right now we saw humans which is not scalable but at least evaluates the thing that you want and we saw these close-ended evaluation this kitchen syn as I call it um of close-ended evaluation that are scalable but really don't evaluate the thing that you care about uh so the idea of LM based evaluation is that you can try to make the human scalable by replacing the humans with another LM and and for this I'll talk

about one paper that rewrote um nearly two years ago actually or like one and a half year ago um which is called alaka eval so this is this Benchmark which is um one of the most common Benchmark used with um LM LM based evaluation so as I said the goal is scaling uh human uh evaluation so the background just to give you a little bit of background is that we were developing uh at the time the model which ended up being called alpaca so that was two years ago so that was an instruction following model

so you can just think about it as um some replication of CH GPT um kind of very early on uh we wanted to to have a model open source is somewhat similar to to chpt um so what what do we need for this one uh to train an LM or to train an instruction following LM you need some data um so I'm not going to talk too much about this but basically we collected some data using another LM also uh two you need to have a base model so we used Lama 17b uh that's already

a long time ago and you need to have a loss or like a yeah a loss that you that you're going to be optimizing um and we used um supervised a supervised loss this is not that important the thing which I want to I want to highlight here is that once you have a data and you also have a way to train your model on this data you then need to tune your learning rates you need to tune all your high parameters so you need to have some some uh development Benchmark uh to know whether

which high parameter you should be using um and that's why our idea came to use basically an llm for telling us whether the model the generations that we were getting from these models uh were good or not and to use this as a as a proxy um proxy evaluation Benchmark essentially to know whether we're making some progress and which which high parameters we should be using and once we did all of this that's how we got this alpaca 7B uh model so as I said when you when you design a Benchmark you need to think

about two thing two things the metric uh and instructions so let's talk about the metric the metric is actually very easy what we did is that we ask a question for so you ask a question this is an instruction uh you then have two models let's say that the model that you want to evaluate is in Blue uh you ask this model to generate an answer to this question so this is now open um uh free form Generation Um and you ask another basine so in our case we use the v03 which is a old

open eye model to generate it also an answer and then what we did is that we took these two answers and we ask another model uh so I think gbd4 which of the two models were better so we basically it is exactly what sorry we did exactly what um Chad body Rea does uh but we did with LMS uh instead of a human and then the LM tells you which one is better and then you reproduce this over your entire Benchmark so all this instruction set that I'll talk about later and what you get is

what we call uh what we call this Ren rate uh so it's the expected uh preference uh for this llm so This Oracle LM so in this case GPT 4 um it's the expected preference of gtb4 between your base uh model and sorry between your the model that you're evaluating and the base model uh or the Baseline so this is this expected probability of of winning so this is the win rate and this is the metric that we use so it's actually very easy to compute um and it's very scalable so does that work so

one thing you might wonder is well like now you just basically ask LMS to evaluate how do you know that it's doing any it's good at doing this job so that was two years ago actually one and a half year ago this result here you see on the x axis the price um for a thousand examples so humans we used M turus here this is around $300 for 1,000 examples and here on the y- axis you see the human mode agreement so whether a human agrees with the mode of like what is the agreement between

one human and the mode of all the other humans so basically how much humans agree with each other you actually see that the metric is very low 66% the reason why that is is because actually if you look at two answers that are very similar with a few words that are different people will really not agree on which one is better and there's a lot of varant people might even disagree with themselves if you ask it today and like maybe the next day uh so the test is just very noisy and very hard very subjective

um and gp4 got around like 65 or 64% uh so that was again like two years ago and we were able to kind of increase uh and and improve these uh these um uh this this evaluator so essentially we were able to get um LM based evaluators that are as good as humans in the sense that they agree as much as humans or even more with humans than humans themselves but they are at least 30 times cheaper and that was two years ago right now actually the the current evaluator on the alpaka eval leaderboard a

benchmark is around here which means that it's more than a 100 times cheaper uh than humans and and actually much better than humans um at agreeing with other humans which is pretty uh pretty surprising or could be surprising and what is nice is that it actually keeps improving uh so every time you will have a new model that is uh released by like open a um or or a new cloud model or any other model at the best possible one we'll just use this and basically this this this point will keep going to the top

left um so this is one of the reasons why I'm really excited about L based evaluation is that given that they're already good now they will keep improving so it seems to be uh worth your time to uh um to to spend uh on designing kind of these bench marks great uh so the instruction set so here H our instructions are there's 800 instructions here are a few examples um the instructions instructions from alpaka eval are honestly pretty easy pretty simple uh because they were designed for the models two years ago uh so here's some

a question uh what if twing had not cracked denigma coding World War II what would have happened um and like is mfcc the same thing as M spectrogram or what is the difference or like writing some rap song um on on a speech or or things like this so how did we choose this instruction set uh well we want we wanted to aggregate we basically aggregated different data sets that we found online and the way we did that is that we wanted um to have instructions that allow you to distinguish between models so as I

said before uh for model development it's really important to have this uh this distinguishability this ability to distinguish between models so we wanted this and another thing which we wanted is we wanted um our instructions to be realistic um and of course it's a question of what does realistic mean so given that we had the alpaca demo that we released two years ago we had around 300,000 people who who talked or who chatted with our model so we had a lot of questions that people asked through this model so that's kind of what we Define

as realistic um and and the way we I'm just going to go briefly over this but basically the way we did it is we chose instructions that allow you to distinguish between models so here you see kind of P values um uh for for like test P values to tell you whether two models are whether one model is better than another one and you basically want to have your P value very uh very low um so the exact details are not that important but basically you spend some time on the instructions and and choosing instructions

um great so the back eval Benchmark so right now the at least when we released it um the correlation between the alpaka eval Benchmark and the Chad B Arena which we decided was kind of this at least silver um uh silver standard for evaluation was around 94% correlation or spent correlation so it's pretty high so that means that uh the ranking of models as given by humans that is not very scalable and the ranking given by this llm is actually pretty pretty high uh and it's much more scalable it takes less than 10 $10 and

three minutes to actually run this Benchmark and it had a decent amount of community uptake uh so we already have kind of 200 models from the community um that were that was submitted on The Benchmark uh so summary about alpaka Eva and more generally kind of LM based evaluation um is that we use an LM to scale evaluation of instruction following uh LMS the benefit it's it's scalable and actually if you if you spend some time on on doing things well you can get High Fidelity in the sense that they agree with humans so the

LM based LM judges agree with with humans so that all seems great I want to talk about like some challenges uh about this entire kind of of pipeline that we had or that we still have so I'll talk about alpaka eval L control which is the next version of alpaka eval and what what was the goal there and what was the issue that we we initially faced uh so the issue was about decreasing Spurs correlation so we really talked briefly about three correlations before uh so the background is that when we uh released alpaca eval

um actually 74% of the time alpaca the LM preferred uh the answer that was longer so basically it has as as a just as as a reminded has two answers and if you as you ask this model which one of the two is better and 74% of the time it just prefers the longer one um that is not completely shocking in the sense that humans also prefer uh longer answers uh part of it is just that if if you ask humans if you ask humans two possible uh answers and they don't want to spend too

much time kind of evaluating the answers they will just probably um have some spous correlations and just say okay the longer one is is most likely better anyways we had around 74% preference for longer outputs uh so when we release the benchmarks so here you see on the x-axis time and the y axis you see the average length of the top 10 models uh on our Benchmark when we release The Benchmark um the average L for is around 1,200 characters and basically over time and especially in 2024 H we saw this huge bump um so

basically the models became the best models became longer and longer and longer or at least the best model on our on our Benchmark uh and this seemed very uh was very woring actually very concerning in the sense that a lot of the models that were getting better uh in our leaderboard honestly W better once you interacted with them so it seems that they were really kind of over optimizing for length because either directly or indirectly people had realized or like the models had realized that this Benchmark um is uh is has this spus correlation for

length so basically when you do hyper tuning based on it you end end up having models that are much longer or that generate longer outputs so I'll talk briefly about how we fix that uh so we took a causal perspective which is that we ask ourselves what would the metric be if the Baseline and the model output so the one that you're valuing had the same length that's just a question that we ask ourselves uh so another way to to view this is you can take a uh causal graph so this is the preference as

given by alak eval judge the preference depends on the model uh that is generating the the answer on the instruction but also um on the length of the output and the length of the output depends on uh the model itself so basically what we said is instead of of looking at kind of this causal path through all these nodes the only thing we care about is this direct path between model and preference we don't want to um take into account the the the causal path between length of output and preference um so the the the

idea the way we did that is just to use a regression analysis um which is a very standard technique in kind of Statistics to deal with these type of problems so specifically what we did uh is that we took um all of the we took we modeled sorry alpaka evals preferences as a function of uh these three uh nodes model length of output and instructions um so the the exact form of the of the model that we use is not important we basically use a glm so just think about it as a as a logistic

regression with some uh fancy features and once we have this model uh so this is a model that basically generate predicts what is the preference of Al eval based on the on the on the name or like which model you're evaluating the length and instruction and once we have we have this model we can simply ask this glm so this model what would the preference had been if the Baseline and the model output had the same length so basically what you have to do is here you have this length component which depend on the length

of the output of the base mod Baseline model and the model that you're evaluating you just have to set these two to the same uh which means that all of this become zero and all you end up predicting with is this this simplified glm um and this is what we call the alpaca eval length control preference so just a quick summary we first modeled um alpaka eval preferences based on model length instructions and we essentially just dro the length term uh to get the alpaca eval length control preference so this is the model of alpaca

eval preferences and this is kind of this predicted preference if the Baseline and the model outputs had the same length um so benefits of using this well actually I didn't talk that much about this glm but it retains a lot of nice mathematical properties so you can still be interpreted as a RN rate uh so more importantly Bic it's still between zero and 100 and it has a lot of nice properties of rates another thing is that it's model independent so what I mean by this is that every time you add a new model you

have to train a new G glm but you don't need to retrain the glms or the the models the alpaka Eva length control uh RN rates for the other models on the Baseline which means that every time someone submits a model you're not changing the the win rates of the other models which would be pretty annoying Academia and and finally it's easily extendable so right now we only talking about this spce correlation for length but you can you can do the same you can use a similar technique for for many different uh SP correlations just

by adding a new term in this glm great so all of this uh let's see whether it actually works uh so here what I'm showing you is alpaca Evo this was uh some some Metric this is the r rate for different models and here I'm showing you on the right what if I add ask the model just prompted to provide more verbose answers and what if I prompted to provide more concise answers and what you see is that the Run rate actually has a huge variance so for the same model if I pr it to

be more verose it performs much better on our Pak EV Val at least the original version uh so it has this huge variance when you use this length control that P eval so this is the length control run rates you see that is very little varus so here just by prompting it to be verbose you only go from 50 to 51.6 instead of 50 to 64.2 um so it somewhat worked in the sense that it's less bias and uh somewhat surprisingly or at least interestingly that was not what we were optimizing for but by doing

this uh this length correlation sorry this length um uh length control uh we actually increased the spean correlation uh with chaden so I told you before the Ala eval had around 95% um was sorry and 0.94 spean correlation which was the highest with mt bench well actually now with this length Controla we have 0.98 fan correlation with Chad Arena which means it's is very very highly correlated with Chad Arena and um I I believe at least one we release it I believe it's still the case uh this Benchmark is the highest the one that has

the highest correlation with chatarina which is this human ranking that we saw before and here you also see some different uh different benchmarks and how highly correlated they are with ch we know great so retrospective they work uh so here I'm showing you the same graph as before but I'm showing you uh the additional part of the graph after alpaca Eva left controlled was actually released and we see that uh just after it was released uh basically the length of the Top Model really decreased and now it seems to have been increasing a little bit

again um so I would say it seems to have worked but maybe it needs an update I don't really know um whether this is kind of true uh truly better models or not so I would have to to check again but it definitely seems to be better than before great so summary uh we basically used uh regression analysis to alleviate some SP scalation in LM based evaluation um but a broader summary I want to I want to share is that you really need to be careful when using these LMS I really I'm I'm all in

for using LMS for evaluators but there might be some issues and you need to be uh kind of on your toes and and and be ready for um for alleviating some of these issues uh so benefits it has L less length bius as we saw and it actually ended up with higher correlation with humans great uh so I talked about alak eval and alak eval INF controlled another probably just as um common um Benchmark that is based on LM judges is called Mt bench uh so Mt is for multi-turn so it's basically similar to Alpac

eval but it it's multi-turn so Alpac eval is single turn which means that you ask one question to the model and then after the answer it stops it also has good correlation as we saw before it has around 0.94 correlation with uh with Chad so slightly less than Al Bo very high um and the type of questions are things like first turn compose an engaging travel blog post about a recent trip to Hawaii um blah blah and then second turn might be like oh you WR now your previous response start sentence with the letter a

uh so it's actually pretty pretty cool that it's about multi- great so L and Bas evaluation Pro is that uh it also evaluates the the real thing you care about so um free form generation it's very scalable and only this it will improve over time so I think this like one of the the key things is that humans are are kind of I mean our evaluation won't really improve over time while these models will keep improving uh which means that they will become more scalable and and High quy uh the downside is that there's some

trust issue as we saw sometimes we maybe don't understand um or we we might have missed some spir correlation so people always um have some trust issue with some of the LM based evaluation so this is something that as researchers we need to improve um another thing is that you need an orle llm it's actually pretty annoying because as as I said it's actually hard to use uh these evaluation based on LMS uh for state-of-the-art llms because they cannot use a better LM for developing their own and not only this you can also not use

maybe uh LM based evaluation on on on domains where LMS are really bad for example math um also we kind of as humans we kind of gave away our control and now we're not really the ones that decide what is better and what is what is not now we we relying a lot on the alls okay So based on what I just said which is these pros and cons I want to mention one additional project that I'm actually still working on I want to give some um kind of preview of results because I thought it

would fit well um with kind of how the community I think will be going and what it should be going towards which is how to scale uh evaluation in cases of expert domains as I said this is preliminary results um so things might still change as I as I just said LM based evaluation scalable requires Oracle LMS which is not good lack control uh so we basically give away control as humans and other thing is that it lacks interpretability um so when when we look at these four things um the reason why why we have

this pro and con is that really the reason why we have these cons is that we as humans we um how do you say that we there are two there are two things with evaluation there's deciding what the correct answer is and then there's actually applying it and in this case when we do LM based evaluation like alpaka EV like empty bench we basically uh put these two things together and now the the the LMS are doing both they decide what the correct answer is and they they basically apply it so what I think we

should be doing is kind of separating both we should still have humans decide what a good answer is and how to apply the evaluation uh but then we should just decide uh that LMS are the ones that should be applying it like this will be scalable so to be slightly more concrete usually when you build a benchmark as I just said before uh you have a um you basically just use humans for one thing which is writing instructions um what what I'm suggesting is that you should also write humans or experts when they write the

instruction they should also Define how to evaluate answers on this instruction so one way that you could do that is by writing a very detailed analytic rubric so this is for example how it's done in Education for example in um yeah in different education levels you might have these kind of analytic rubrics which says oh here the like the four different axes that you should be considering uh here's how you get an excellent score at this a fast Squad that and four Squad that and this uh is a one-time cost so this is basically writing

The Benchmark and given that it's a onetime cost uh you should still have humans write it even though this is uh slow to write much slower than writing instructions it's still only a one-time cost and here trust is key because this is defining how we're going to evaluate our our Benchmark and once you have this you can have uh humans sorry you can have llms a condition on this kind of uh uh evaluation strategy written by experts you have the LMS applying their evaluation so the reason why I think about it that way is that

applying the valuation is a recurrent cost so here scalability is key so what we'll essentially do is that we'll uh humans do everything which is a onetime cost or experts do everything which is a onetime cost and then applying it at scale with llms and um and the the the human effort will essentially be amortized over the entire benchmarks that we're going to apply um so here the LM will just decide which axis has what score uh so the benefit of this um oh maybe one before talking about benefits uh I just want to mention

that here there are actually many different possible ways I just talked about analytic rubrics but there might be other ways of of writing this evaluation guides uh one possible way is to write like checklist like just a list of questions to tell that the expert would say oh LM you should actually consider all these things in the answer another one is to write the actual answer so maybe the expert might write oh here's a potential answer and here is what you should be looking at uh when evaluating uh answer from LM we saw the analytic

rubric which is this like Matrix and maybe you can also have a kind of list of Errors so I don't know how it works at upen but at least at Stanford we use gr scope which has kind of these list of Errors um and for every error you get a you get a score deduced so that's how we grade um classes on or assignments at Stanford and and and basically that's a potential thing you could do so imagine I write an instruction you end end up writing like 30 potential errors and how to deduce scores

and how much SC how much points to deduce if you make such mistake uh so preliminary results um so this is actually on an expert machine learning Benchmark that I wrote myself because I mean it's basically the only thing I'm um somewhat of an expert in um and you see here the naive uh evaluation um so this is the the spean correlation with uh the ranking as given by humans naive means that the models is not um conditioned on anything so this is basically Mt bench and alpaka eval and and all these four show when

you condition on the checklist the solution the rubric and kind of this list of errors and what you see is that the spe and correlation increases by a lot again this is preliminary results I'm still hoping that I can increase these things but you already see a huge huge bump from from less than 0.2 on this very hard machine learning data set to around like 0.6 per correlation and what is nice is that this kind of this gain is independent of of what LM use for the evaluator so here you see for clo 3 Opus

applying the evaluation gp4 G pt4 mini and and things like this okay great um so I'm done with uh with academic and open benchmarks I'll chat briefly about some challenges and then I want to leave some time for questions um so challenges oh I saw um so challenges uh with open benchmarks and kind of academic benchmarks one of them uh that we haven't talked about is consistency uh so actually I I told you before that when you have these multiple choice questions it's very easy to evaluate but that's not completely true uh for example the

order and I think one question alluded to that before the order of the the the answers that you give or the choices that you give will really change uh the quality of the of the answer sorry will really change could really change the the the evaluation score that you get and depending on how you evaluate your model that is only true not if use this log likelihood um way of evaluating that I told you about before but for example if you just ask the model oh is is the correct answer A B C or D

uh then if you reorder ABC or D it might change the answer that you give so consistency is an issue uh to give you a very concrete example of challenges of consistency uh so we talked about mlu and so mlu actually has different implementation in the sense that people use different prompts and different ways of evaluating the generations with mlu and you end up with very different uh scores so just to give you an idea I told you before mlu is this multiple choice question you have the LM that choose uh the most likely answer

but what does it mean to choose the most likely answer as I told you before one thing you could do is just look at the likelihood of all potential answers and just say which one is the most likely um another thing you could do is just say as the LM is it ABC or d uh which one is the correct answer another thing you could do is is also ASM to generate the entire answer um so just sample ABC o d and see what see what it generates but but maybe what it ends up generating

is something that has nothing to do with ABC or D maybe it generates like e or maybe it generates like zombie or something completely random um so depending on how you decide to to evaluate your LM or to apply this or to decide what the correct generation is you might get very different um different scores uh for your for your uh models and actually there used to be now it's thankfully it's fixed there used to be three different implementation of mlu the helm implementation um the original one and the harness one which is essentially the

one from uh hugging face uh the hugging face open LM leaderboard um early on and you see that the mmu scores for example L 65b were very different so all this to say that even in the simple case of kind of closed and close ended evaluation uh or close and it tasks depending on how you prompt the model how you decide what the correct answer is and how you should be sampling from it and stuff like this you get a really different scores and people in the community usually uh I mean there will inevitably be

some uh differences and and it might um yeah it's it can be very annoying let's put that way another challenge which I think someone asked about before is contamination uh so here are a few tweets that um point to contamination so here s um is says uh that in code Force Benchmark um everything which was pre 2021 the models perform 10 out of 10 and everything which is recent the models perform zero out of 10 so that clearly uh points to the fact that that this this test set was actually uh in the in the

training set of these LM um because pre 2021 the LMS actually uh know perfectly the answers to that that's this type of contamination which means that the the the models are actually trained on some of the test or some of the Benchmark which is definitely not something we want um and here's another one where Suzanne just asked to fire 1.5 which is in LM and it just asked um to calculate the number of down sorry uh oh yeah it it asks to kind of generate the first question of the Benchmark and it's just able to

to generate the entire question question of the benchmark which clearly shows that that the model was actually trained on that um uh yeah sorry I'm going to skip through I saw your question on multi-turn I'm going to skip through that because I want to go through the rest but I'll talk about it during The Question session uh so for detecting contaminations uh so as I said um contamination is a big problem there are different ways you can you can uh try to detect that one way is uh to look at the likelihood that is generated

by the model if the likelihood of the model is very high so if the model is really certain about its answer it kind of points to the fact that maybe it was it was in the pre-training set um that's like one way another way which is um which I actually find very smart is to do an exchangeability test uh so the idea here is that um most of the benchmarks online they're always in the same order uh so if you look at the likelihood of the canonical order of the Benchmark or the shuffled order if

the likelihood changes because every example should have no relation to the previous of uh or next example if the likelihood changes then it means I was actually trained uh or pre train on that data set because it means that it saw the data set in exactly this order even though really there's no there should be no uh ordering in this Benchmark example one could come after example two so if it gives a high higher probability of example one coming before example two it means I was likely in the pre-training set that's some ways of detecting

contamination um there are some ways for alleviating this uh one way is to basically have a private test set that you never put online um uh another way is to kind of have these Dynamic test set um so this is for example what is done in in in chat Arena where where humans kind of interact every time with this model and ask different questions so this Dynamic um so if you constantly have different inputs then you canot have disc contamination another challenge is saturation uh so basically very quickly every time there's a benchmark that is

used by the community I would say in like six months it ends up being a um uh completely saturated or maybe not six months but like at least two years in between six and two six months and two years become saturated this is actually one of the reason this is the the the plot from um openlm V1 and you see why they had to change to openlm V2 because basically all the all the benchmarks that they were using they got saturated in in less than a here uh so here you see human baselines and you

basically see that for most of them the the best LM reached the human Baseline and we actually had the same type of thing with alaka eval where after six months um the models went from like 10% win rate to like 90 95 which is why we had to do a AAC a eval at 2.0 um other issue of benchmarking so this is from a paper I think at ACL I can't remember exactly um most so this is the paper that analyzes a paper that was submitted at ACL and basically you see that most of the

papers so out of 461 papers uh 69.4 look at uh English so basically honestly most of the evaluations are essentially about English and about performance uh so there are some multilingual benchmarks which I I um link to here and uh I think more people should be looking at those um okay I think this is probably enough uh for now and maybe one last thing and then I'm going to open it up for questions um One Challenge I would say the challenges of challenges uh is that researchers are actually incentivized to keep the same Benchmark even

if the Benchmark is bad and the reason why is because in every time we write a paper uh we have to compare to previous results previous results on the previous benchmarks and then you incentivized to basically always use the same type of benchmarks uh so in 2019 which is pretty recent actually 82% of papers in Translation was still using blur which is this metric I briefly talked about before but at this time there at this point there were already so many better metrics um so yeah so this this issue that in Academia we kind of

uh force or incentivize to to keep the same type of benchmarks even if they're not great okay so I'll uh skip through all the rest uh and happy to take some questions uh so I'm going to read through um questions thank you do we get a round of applause for y okay I see there's a bunch of questions already uh so let me scroll up so since k thing transforms the question okay this already talked about um is there what is there work to mitigate concerns with please write also the questions in the in the

channel uh sorry in the the chat I'm going to read them right now is there work to mitigate concerns of these Ben leaking into the LM training Corpus so that's what I showed you before uh we can um can you dive deeper into how to evaluate multi- chats um so multi-t chats oh yes okay so multi-turn chat is done exactly the same way uh in the sense that you can if you use the empty bench it's the same way as alac eval which is that you ask um the first question then you provide the answer

the answer from the model then you ask a second question and then you just ask uh the LM judge how good is the entire chat uh rather than saying how good is the last answer um so there's nothing at least when you use uh LM or humans there's nothing hard about multi-turn you just ask now the LMS or the humans how good is the entire chat um so it's still very subjective uh and but there's nothing more challenging with with this than than in the single ter case other question when we use LM to make

evaluation do we need to fine-tune the judge LM um okay that's a great question so how people do it most of the time is actually they don't fine tune so our P EV is not fine tuned empty bench is not fine tuned it's just using the best possible models that we have out Des so usually either cloud or gbtn um and you basically prompt so you say what what it should be looking at um so for General domains this is usually good enough H for expert domains I think that's actually not good enough so as

suggest probably uh LMS might not have the background knowledge which is why I would do what I told you before with the rubric eval where you would have the experts kind of write very detailed rubrics or very detailed um guidelines for every instruction for how to evaluate those because the models are actually very good given rubrics or given detail guidelines they are usually very good at applying them um other question can you please explain a bit more the difference between rubri Eva and existing LM as a jge setup um yeah so uh the big difference

here because you say like it scale with language explanation for each score um so the big difference with Rubik eval is that the guidelines are actually written for every instruction so right now for most uh benchmarks you basically have guidelines for example for lied scores that are written for the entire Benchmark so it's like oh you should be considering uh you should give a five if if uh the model is if the output is good but then what does good mean so in the case of rubric Eva well it's really an instruction specific type of

um of guideline where you're like okay good means that this term has to be in the answer and this type this term has to be in the answer uh so there are some uh benchmarks for example wild bench that already has some of checklists but but they are really not very specific um so actually I didn't show it in in our results but you really don't see a big um you don't see a big gain from those uh checklist because they're not specific enough so I saying very detailed rubrics where you have experts spend the

time to write detailed guidelines uh so Oscar asks when using gb4 as a judge will gb4 favor its own output this is a very good question it does um so it does but you might be surprised at the fact that this bias which we call Self Bias is not as high as what you might think so for example we did that for Alpa Eva where I have gp4 ranking five different models then I have Cloud ranking different models and I have I forgot which one m I think ranking five different models and you see that

actually they all agree at that time at least that gb4 was the best model um so uh so even though there's a there's a self-bias uh they can still prefer the other models um another another reason why I don't think it's that much of an issue is that it's it only biases for a single model so right now for example aak Eva has 200 models or even more on The Benchmark like it only means that there a there's a bias for one of the 200 which I think is not that bad plus as I said

it's maybe not as strong as what some might think uh next question if the model is contaminated by the data set how do you tell what the model learns from fine-tuning versus what is already known from pre-training or should you just try and avoid that Benchmark entirely um that's a good question I [Music] um yeah I've never seen work on that I don't think you can know besides try again to fine tuning on a different uh uh Benchmark um on a different sorry data set that being said usually fine tuning is not done on um

on on really like data sets sorry on benchmarks so the the reason why they can be contamination I don't think many of the I don't think any companies really want to have contamination what what ends up happening is just that they they they download as much of the data as possible that is for example on GitHub it just happens that like the alpaka EV instructions on GitHub so if they don't do if they are not very careful about removing those instructions or this data set from the pre-training Corpus uh then they will actually have this

this issue of of training or pre-training on the test set but for fine tuning usually it's a very it's a much smaller subset you only do fine tuning on maybe Max a million but usually like at least in open source Community less than 100,000 examples you know very well which data set you're you're fine tuning on so um I think you never basically fine tune on your benchmark besides if you actively try to cheat at The Benchmark so next question uh what are the potential future research direction for creating such benchmarks um great question so

as I said I'm really excited about kind of uh this rubric eval uh way of thinking which I would just broadly say it's basically human in the loop uh so I think right now I talked about llm doing the evaluation but I really think we need to have humans in the loop even more so when you put models in production and it's like how do you best use how do you best make LM LMS collaborate with humans um to make these benchmarks more scal scalable but also higher quality I think rubric eval is one just

one potential uh approach there might be others another thing which I I excited about for Benchmark for for research Direction in this kind of LM based benchmarks is that right now I only talked about llms uh that are used for defining a metric um but I said that the instructions are kind of fixed uh I but that's not how we do it as humans when we interact with a model if I want to know whether a model is better than another one I will basically ask some questions and try to search cases where the model

is bad at and I I would love to see more work where uh LMS try to generate the instructions that will basically uh find what what the model is bad and what the model is good at there's already some work in adversary red teaming is a little bit different because it just tries to find the cases where your model is terrible at I think what what you should be really thinking about is when you do a a benchmark is you should be thinking about uh kind of searching such that you can find in fewer cases

the real score of your of your LM so another another way how I view it at least is that you try to estimate this expectation and you want to use something like a important sampling for better for like lowering the variance of your expectation uh estimate and you do that with LMS I have a bunch of ideas for for this but I'll move on for now um uh next question thank you for the talk I I have a few questions could you please share more on the design of the prompt sample base uh PR how

to design the evaluation prompt to The Benchmark and the target LM to reflect the different dimension of criteria we try to evaluate for um is there mechanism to automate the prom generation okay so I'm not sure I completely understand if what you're asking is The Prompt for the llm judge um then it's it's it used to be a lot of work to actually per these models uh at least for example the initial alak Eva we had kind of a few shot which means we had a few examples of how to evaluate these models in the

in the prompt um honestly more recent models The Prompt I feel is less important so it used to be for example really important about using uh kind of like Json format also the ordering this is this is still very important actually The Ordering of the example in which you give so I told you you have two examples and you ask the model which one is better it's very important to uh kind of randomize this order because the model usually prefers the first answer um so generally honestly prompt engineering I think is less and less important

you just have to make sure that you don't have these type of biases for example the order virus uh so try to randomize over these type of things uh now your question about generating um automating the prom generation there's many work on that um I think there are few from Toronto I kind of forgot the uh I think from Roger gr's lab uh I forgot the exact titles of the papers uh there are many but as I said I think prompt uh tuning is actually less and less important uh with these more powerful models uh

and your second part of the question how to ensure the low cost of evaluation practice especially for evaluation based on long PRS so evaluation uh based on long PR uh how to make sure that that mod are cheaper uh so the the the humans will I I've never seen any uh even the more powerful models are always going to be cheaper than humans I mean even the ones one and a half year ago uh which were probably 10x more expensive than now they were already like 10x or 30X less expensive than humans so I'm pretty

sure um I mean if you have a long prompt like humans would also need to read a lot so I'm pretty sure it's always going to be cheaper than than humans uh at at least if you pay them um as you should um if you paid them well uh so yeah but how how to make sure that you have low um low cost uh one thing you can do is basically try to not have too much of a Chain of Thought um honestly The Prompt the input prompts the input tokens are usually um cheaper um

so try to avoid having very long chain of thoughts that's what we of black to decrease the the cost um another thing you should probably be doing is that if you don't want to evaluate your model right now like um uh I think open air and I think also entropic has this thing of batch evaluation or like sorry batch um generation at the end of the day uh where the like divides the cost by two or 3x um so yeah so I I I don't think it's actually ever going to be more expensive than humans

um and it's already pretty cheap uh next in the LM as a judge paper you observed that few shot prompting the LM helped improve consistency when dealing with a position bias but also stated that this method increased cost and possibly ining you have you done or seen further work on the impact of fop promp thinking um so uh just to be clear I'm not the one who wrote the LM as a judge paper um and it's been a while since I I read that one but uh that this method increased cost and possibly introduce new

biases uh personally in ALA Evo I saw that adding um uh few shot prompts uh sorry few shot uh added some biases and made it actually decrease correlation with humans and made it more expensive so that's why I ended up removing this and that's why I do everything in zero shot now um so I did not do what you're asking now which is doing multiple runs of zero prompts uh I think there's a paper from coh here that does that and I would suggest for you to look at this one um I think it's called

llm panel or something like this um great next um because of the uh the order seems to be so important doesn't that actually hint that these models don't reason about the input or is this specific bias also present in humans uh so humans actually have a huge bias for order um so every time you do for example you work with M ters you will always randomize the order in which you show things um so so yeah so I think that's also a bias that is present I think there are probably some work that looked at

whether the humans or the LM have this this as a stronger bias um I don't actually know um but I'm sure there are some work that that analyze that I do agree with you that hints that there's uh some lack of reasoning and I think that's part of of the fact that we don't really Define uh per instruction what really we should be looking at and I'm hoping that things like a rubric eval that say very specifically okay you need this and this and this will will um improve this type of of issue also with

rubric eval it's it's actually absolute which means that you don't need to have two uh outputs um so you don't have this order uh issue anymore but I agree with you that hence that there might be some issue with these LMS which I think is more of of an issue of the fact that we're not uh defining exactly what we want in our evaluation procedure um next sorry uh let me see where I was how to ensure the consistency of evaluated given gp4 or other LM models may be continuously trained and changed by the service

provider how do you weigh the evolution result evaluation generated by different evaluators um so consistency of evaluators uh this is a hard I would never compare um results from different evaluators and the sense that the the correlation is actually pretty high but I would generally really say that you should stick with the same evaluator it's true it's annoying because it continuously retrain with some of the apis that's the benefit of using uh maybe an open source model for evaluation uh so that's probably what I would try to do uh if I really cared about this

um is that I would use kind of like a lama lama whatever 450b or 70b L my three and then you know that it's always the same um so so yeah that's what I would do that said I actually in alaka Eva we have very high correlation between using uh different um if if you compare the ranking given by different uh uh llms as long as LM is like strong enough um but it's true that like the actual absolute number will change but the ranking usually doesn't change by that much so if you really care

about the ranking uh sorry if you really care about the absolute score then I really think you should not be using something like a a alak EV the wild bench or empty um empty bench I think you should be using something more based on likeed scores um that is very well defined something like what will end up being repal hopefully how does saturation happen uh good question I uh I mean there are two things one the models are getting just better and better so quickly um so that's like one possible uh it's just that the

models are getting so good that it it saturates these benchmarks um that's one possible issue another one for at least for the best sorry for the most used benchmarks like mlu um oh I think there's a h stop now okay uh you feel free to ask me questions um online or I think the the video will be on YouTube so you can also ask questions there and I will try to continue answering there um thanks everyone for coming and I think there's a hot stop so H we should leave it there Yan thank you so

much for your time we really appreciate it and uh yeah if anybody still on the zoom has yeah if anybody has any further questions for Yan please feel to feel free to reach out to him directly thank you Yan great thanks bye thanks for the invite