Join my AI Academy - https://www.skool.com/postagiprepardness
🐤 Follow Me on Twitter https://twitt...
Video Transcript:
so there actually was a new research paper that could put the AI industry in Jeopardy and I'm not saying that lightly this one is a little bit concerning because it refers to the reliability of these models so I came across this tweet here on Twitter where it basically says that 01 preview shows a whopping 30% reduction in accuracy when Putnam math problems are slightly variated so essentially what they're stating is that when they take a popular Benchmark and they have some slight alter ation to said Benchmark there is a whopping 30% reduction in the accuracy of these models when evaluating these models and this isn't really a good thing because of course as this guy speaks about Elvis he basically says that you know robustness okay is important as it's key to a model's reliability okay and a model being reliable is something that we need for the AI industry because if a model isn't reliable and it isn't predictable then we can't have wild scale applications for this thing because of course if you want to use it in finance you want to use it in business business you're going to want to have to make sure that these models are really really accurate at what they do and if they aren't and if there's so much variance like a 30% reduction in accuracy from a test result then that's not going to be something you want to use inhouse across your different projects so essentially the paper starts here you can see it says abstract as llms continue to advance many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated and sat Benchmark saturation is something that the AI industry kind of actually wanted so this this is not surprising but you can see right here that the you know the benchmarks are becoming saturated so this is where they're like okay so we decided to present the putam axom original Benchmark consisting of 236 mathematical problems from the William LEL Putnam mathematical competition along with detailed step-by-step Solutions and to preserve this benchmark's validity and mitigate potential data contamination we created the Putnam axom validation Benchmark with functional variations of 52 problems basically taking that look all we decided to do was alter the problem elements like variables and constants and we can generate unlimited novel equally challenging problems not found online so basically that's what they're stating that like you know because we're able to just change a few different things about the research papers we're now able to create tests that the model hasn't seen before and basically it says that we have you know see that almost all models have significantly lower accuracy in the variations than the original problems and our results reveal that open eyes 01 preview the best performing model achieves a mere 41. 95 accuracy on the putam axom original but experiences around a 30% reduction in accuracy on the variations data set when compared to the corresponding original problems basically stating that look we have you know 41% on a test that was previously potentially out there in the wild and then when we change this to a test that we know for a fact the 01 preview hasn't seen before there is a 30% reduction in accuracy now of course some of you might be be saying why is this 01 preview why is this not 01 why is this not 03 number one 03 I'm pretty sure they don't have API access just yet because this model is so new and of course 01 prom mode at the time or 01 at the time there was no API when they decided to conduct this study now here's the two changes they made like I've explained before the variable change the simplest variation is a variable change where variable names are altered and the final answer is unvaried variable changes slightly modify the problem from its original statement Which models could have trained on constant change constant change is you know modify the numeric properties so long story short this is not nonsense but essentially what we have here is just where they're stating that look we just altered the numbers and we just altered some pieces that the text and they actually show us what that looks like here you can see the constant change which is you know something that is just constant it just remains stable like these letters so you can see right here they just put the P's and they change them to the Y's so you can see right here the Chang variables X to w and y to V and P to L so that's where the you know a constant value is being changed then the variable changes where the values can you know change so just numbers changes they put you know these numbers so you can see right here they changed to 21 to 4,680 so you can see right here 2011 was Chang to 4680 and this is basically the kind of Benchmark that they did so it's not something like they created an entirely new Benchmark everything was basically the same but they just changed the letters and they just changed the numbers so numbers numbers being the variables and letters being the constant so you can see why this is something that when we think about it the model should get this right because there's only a very small thing that's being changed and so you can see right here this is the chart that has I guess you could say a lot of people freaking out in the AI industry and I guess you could say those critics who are criticizing these models stating that they can't reason I guess this is something that you know is now on their trump card so you can see right here it says that the drop in accuracies on this Benchmark is statistically significant for nearly all models and the figure shows the mean accuracies with 95% confidence intervals it's pretty crazy is that we can see a large drop here from you know the original on1 we can see that it's around like 50% on the original and then we can see it drops all the way down to nearly 30% on some cases or 35% which of course isn't great if you want to have a model to you know be really really robust in the sense that it sees the problem and it reasons through it in the exact same way and this is something that you know I talked about in another research paper video which I will be discussing later but you can see right here that when we take a look at exactly what's going on it actually talks about how the low accuracy underscore how difficult The Benchmark is which is of course pretty good and it does show you know that the model is able to do pretty decent so 01 preview you know is a decent model but um you know what they stay here is that for models like 01 preview GPT 40 claw 3. 5 Sona and numina math 7B nonoverlapping confidence intervals reveal statistically significant differences indicating artificially inflated performance on original questions due to data contamination looking at the numbers highlights significant accuracy declines across all models GPT 40 shows the steepest drop at 44% followed by 01 preview at 30% and GPT 4 at 29% and even the very famed CLA 3.
5 Sonic at 28. 5% so this is a significant drop on these questions when they just managed to change those two things so we can see here that this is definitely a problem because if we're trying to get these models to be robust to be reliable to have wide scale usage this is an issue that we have to solve because you can't have a model that you test on a specific benchmark it does really well and then when you start to use it in production the accuracy manages to drop to a point where it's no longer you know at that reliability threshold where it's viable for a company to use now we can see here that you know I'm going to you know talk about two things very quickly one of the things that I want to talk about that could potentially explain this and this is something that they talk about in the paper is of course overfitting and this is where essentially you create an llm that matches the training set so closely that the model fails to make you know new predictions on new data so this is basically where you can say the model is just trained so much on test data that it performs well on tests but when it comes to real world scenarios it just doesn't perform well and this is an issue that a lot of people do say they have with these smaller models they say that they just overfit them with test style questions like the GSM 8K and then because of that that those small models that we think are getting so amazing they're only performing well on test style questions which which means they're just overfit for the problem and that's why they can perform so well at that now another thing that you know this paper talks about is they talk about data contamination so this is where you you know inadvertently mix samples from evaluation benchmarks into pre-training corpora basically stating that look sometimes when you're training these models on large pieces of text from the internet what tends to happen is unfortunately we can enter a situation so essentially where you have these models that unfortunately because you're Gathering so much data Sometimes some of that test data actually just slips in to the model's training and of course if it's seen it in training then that is something where it's going to do well on that test which is why you know when we actually change them and we have those variances we can actually see whether or not the models do well on those new tests so this is something that does happen quite a lot on llms and this is why a lot of people can be skeptical on benchmarks and this is why I personally advise people that you know even if there is data contamination create your own benchmarks on tasks that you have on a day-to-day basis and then every time a model comes out a model gets updated you put that same prompt into the model and then you'll see if it performs better if it performs worse because everyone has their own specific scenarios their own specific workflows and I think while yes looking at the GSM AK and those things are very good A lot of the times we have our own personal use cases that sometimes you know data contamination might overshadow so you know the paper or also continues to talk about how you know in addition to issues with riger GPT 40 also displayed logical leaps and incoherent reasoning displayed in figure 9 where the model simply assumes that an answer is correct and these logical leaps are symptomatic of an issue in GPT 40's Chain of Thought reasoning as the model prioritizes reaching the final answer rather than providing a rigorous logical output now this isn't something that I would say I'm worried about too much in GPT 40's case because when we look at the paradigms we're currently in GPT 40 is no longer considered a reasoning model provided it can reason through basic instructions when we take a look at what kind of models we're truly using for reasoning you know capabilities and the raw reasoning abilities that is the 01 series of models those are models with vastly more capabilities and of course if those models are performing underwhelmingly then that is of course we should be something concerned about because it's like if we develop a model that is a reasoning model and it has incoherent re reasoning then that is of course not good but nonetheless it shouldn't be the case that we have these models having a large discrepancy between you know the tests and then of course the unverified test like these new kind of tests so it's still definitely not that great because it means that you know there's clearly some kind of flaws in the models potential training data where you know it's been overfitted there's been data contamination and so this is where we get to open eyes1 preview and they do talk about you know out of all models we see the open ey1 preview perform the best on putam axum original receiving 41.