New AI Research Proves o1 CANNOT Reason!

3.15k views2907 WordsCopy TextShare
TheAIGRID
Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitt...
Video Transcript:
so there actually was a new research paper that could put the AI industry in Jeopardy and I'm not saying that lightly this one is a little bit concerning because it refers to the reliability of these models so I came across this tweet here on Twitter where it basically says that 01 preview shows a whopping 30% reduction in accuracy when Putnam math problems are slightly variated so essentially what they're stating is that when they take a popular Benchmark and they have some slight alter ation to said Benchmark there is a whopping 30% reduction in the accuracy of these models when evaluating these models and this isn't really a good thing because of course as this guy speaks about Elvis he basically says that you know robustness okay is important as it's key to a model's reliability okay and a model being reliable is something that we need for the AI industry because if a model isn't reliable and it isn't predictable then we can't have wild scale applications for this thing because of course if you want to use it in finance you want to use it in business business you're going to want to have to make sure that these models are really really accurate at what they do and if they aren't and if there's so much variance like a 30% reduction in accuracy from a test result then that's not going to be something you want to use inhouse across your different projects so essentially the paper starts here you can see it says abstract as llms continue to advance many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated and sat Benchmark saturation is something that the AI industry kind of actually wanted so this this is not surprising but you can see right here that the you know the benchmarks are becoming saturated so this is where they're like okay so we decided to present the putam axom original Benchmark consisting of 236 mathematical problems from the William LEL Putnam mathematical competition along with detailed step-by-step Solutions and to preserve this benchmark's validity and mitigate potential data contamination we created the Putnam axom validation Benchmark with functional variations of 52 problems basically taking that look all we decided to do was alter the problem elements like variables and constants and we can generate unlimited novel equally challenging problems not found online so basically that's what they're stating that like you know because we're able to just change a few different things about the research papers we're now able to create tests that the model hasn't seen before and basically it says that we have you know see that almost all models have significantly lower accuracy in the variations than the original problems and our results reveal that open eyes 01 preview the best performing model achieves a mere 41. 95 accuracy on the putam axom original but experiences around a 30% reduction in accuracy on the variations data set when compared to the corresponding original problems basically stating that look we have you know 41% on a test that was previously potentially out there in the wild and then when we change this to a test that we know for a fact the 01 preview hasn't seen before there is a 30% reduction in accuracy now of course some of you might be be saying why is this 01 preview why is this not 01 why is this not 03 number one 03 I'm pretty sure they don't have API access just yet because this model is so new and of course 01 prom mode at the time or 01 at the time there was no API when they decided to conduct this study now here's the two changes they made like I've explained before the variable change the simplest variation is a variable change where variable names are altered and the final answer is unvaried variable changes slightly modify the problem from its original statement Which models could have trained on constant change constant change is you know modify the numeric properties so long story short this is not nonsense but essentially what we have here is just where they're stating that look we just altered the numbers and we just altered some pieces that the text and they actually show us what that looks like here you can see the constant change which is you know something that is just constant it just remains stable like these letters so you can see right here they just put the P's and they change them to the Y's so you can see right here the Chang variables X to w and y to V and P to L so that's where the you know a constant value is being changed then the variable changes where the values can you know change so just numbers changes they put you know these numbers so you can see right here they changed to 21 to 4,680 so you can see right here 2011 was Chang to 4680 and this is basically the kind of Benchmark that they did so it's not something like they created an entirely new Benchmark everything was basically the same but they just changed the letters and they just changed the numbers so numbers numbers being the variables and letters being the constant so you can see why this is something that when we think about it the model should get this right because there's only a very small thing that's being changed and so you can see right here this is the chart that has I guess you could say a lot of people freaking out in the AI industry and I guess you could say those critics who are criticizing these models stating that they can't reason I guess this is something that you know is now on their trump card so you can see right here it says that the drop in accuracies on this Benchmark is statistically significant for nearly all models and the figure shows the mean accuracies with 95% confidence intervals it's pretty crazy is that we can see a large drop here from you know the original on1 we can see that it's around like 50% on the original and then we can see it drops all the way down to nearly 30% on some cases or 35% which of course isn't great if you want to have a model to you know be really really robust in the sense that it sees the problem and it reasons through it in the exact same way and this is something that you know I talked about in another research paper video which I will be discussing later but you can see right here that when we take a look at exactly what's going on it actually talks about how the low accuracy underscore how difficult The Benchmark is which is of course pretty good and it does show you know that the model is able to do pretty decent so 01 preview you know is a decent model but um you know what they stay here is that for models like 01 preview GPT 40 claw 3. 5 Sona and numina math 7B nonoverlapping confidence intervals reveal statistically significant differences indicating artificially inflated performance on original questions due to data contamination looking at the numbers highlights significant accuracy declines across all models GPT 40 shows the steepest drop at 44% followed by 01 preview at 30% and GPT 4 at 29% and even the very famed CLA 3.
5 Sonic at 28. 5% so this is a significant drop on these questions when they just managed to change those two things so we can see here that this is definitely a problem because if we're trying to get these models to be robust to be reliable to have wide scale usage this is an issue that we have to solve because you can't have a model that you test on a specific benchmark it does really well and then when you start to use it in production the accuracy manages to drop to a point where it's no longer you know at that reliability threshold where it's viable for a company to use now we can see here that you know I'm going to you know talk about two things very quickly one of the things that I want to talk about that could potentially explain this and this is something that they talk about in the paper is of course overfitting and this is where essentially you create an llm that matches the training set so closely that the model fails to make you know new predictions on new data so this is basically where you can say the model is just trained so much on test data that it performs well on tests but when it comes to real world scenarios it just doesn't perform well and this is an issue that a lot of people do say they have with these smaller models they say that they just overfit them with test style questions like the GSM 8K and then because of that that those small models that we think are getting so amazing they're only performing well on test style questions which which means they're just overfit for the problem and that's why they can perform so well at that now another thing that you know this paper talks about is they talk about data contamination so this is where you you know inadvertently mix samples from evaluation benchmarks into pre-training corpora basically stating that look sometimes when you're training these models on large pieces of text from the internet what tends to happen is unfortunately we can enter a situation so essentially where you have these models that unfortunately because you're Gathering so much data Sometimes some of that test data actually just slips in to the model's training and of course if it's seen it in training then that is something where it's going to do well on that test which is why you know when we actually change them and we have those variances we can actually see whether or not the models do well on those new tests so this is something that does happen quite a lot on llms and this is why a lot of people can be skeptical on benchmarks and this is why I personally advise people that you know even if there is data contamination create your own benchmarks on tasks that you have on a day-to-day basis and then every time a model comes out a model gets updated you put that same prompt into the model and then you'll see if it performs better if it performs worse because everyone has their own specific scenarios their own specific workflows and I think while yes looking at the GSM AK and those things are very good A lot of the times we have our own personal use cases that sometimes you know data contamination might overshadow so you know the paper or also continues to talk about how you know in addition to issues with riger GPT 40 also displayed logical leaps and incoherent reasoning displayed in figure 9 where the model simply assumes that an answer is correct and these logical leaps are symptomatic of an issue in GPT 40's Chain of Thought reasoning as the model prioritizes reaching the final answer rather than providing a rigorous logical output now this isn't something that I would say I'm worried about too much in GPT 40's case because when we look at the paradigms we're currently in GPT 40 is no longer considered a reasoning model provided it can reason through basic instructions when we take a look at what kind of models we're truly using for reasoning you know capabilities and the raw reasoning abilities that is the 01 series of models those are models with vastly more capabilities and of course if those models are performing underwhelmingly then that is of course we should be something concerned about because it's like if we develop a model that is a reasoning model and it has incoherent re reasoning then that is of course not good but nonetheless it shouldn't be the case that we have these models having a large discrepancy between you know the tests and then of course the unverified test like these new kind of tests so it's still definitely not that great because it means that you know there's clearly some kind of flaws in the models potential training data where you know it's been overfitted there's been data contamination and so this is where we get to open eyes1 preview and they do talk about you know out of all models we see the open ey1 preview perform the best on putam axum original receiving 41.
Related Videos
🚀 SpaceX Launches Starship Flight 7 and Attempts Another Booster Catch
🚀 SpaceX Launches Starship Flight 7 and A...
NASASpaceflight
New STUNNING Report Reveals Which Jobs Are GONE By 2030
23:15
New STUNNING Report Reveals Which Jobs Are...
TheAIGRID
56,804 views
OpenAI Explains Why NOBODY Is Ready For Whats Coming?
21:55
OpenAI Explains Why NOBODY Is Ready For Wh...
TheAIGRID
48,819 views
How Many Black Holes Are In The Solar System?
21:07
How Many Black Holes Are In The Solar System?
PBS Space Time
10,243 views
The Dome Paradox: A Loophole in Newton's Laws
22:59
The Dome Paradox: A Loophole in Newton's Laws
Up and Atom
1,622,100 views
MUST SEE: Gavin Newsom SMACKS DOWN Trump & Elon Musk over fire lies
26:11
MUST SEE: Gavin Newsom SMACKS DOWN Trump &...
Brian Tyler Cohen
233,683 views
The Genius Behind the Quantum Navigation Breakthrough
20:47
The Genius Behind the Quantum Navigation B...
Dr Ben Miles
1,432,376 views
Bill Burr on People Online Commenting on the LA Fires & Getting in Touch with His Emotions
12:36
Bill Burr on People Online Commenting on t...
Jimmy Kimmel Live
1,955,919 views
How Putin's DPRK troops were wiped out 'in minutes' by Ukrainian drones
23:02
How Putin's DPRK troops were wiped out 'in...
Times Radio
78,369 views
How To Use ChatGPT Scheduled Tasks Tutorial (ChapGPT Tasks Tutorial 2025)
11:46
How To Use ChatGPT Scheduled Tasks Tutoria...
TheAIGRID
2,872 views
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Visualizing transformers and attention | T...
Grant Sanderson
315,774 views
Researchers STUNNED As A.I Improves ITSELF Towards Superintelligence (BEATS o1)
28:42
Researchers STUNNED As A.I Improves ITSELF...
TheAIGRID
113,403 views
[4K] Watch SpaceX Try To Catch Starship's Super Heavy Booster!
[4K] Watch SpaceX Try To Catch Starship's ...
Everyday Astronaut
Can Europe survive or thrive in the new world order? | Business Beyond
28:54
Can Europe survive or thrive in the new wo...
DW News
29,119 views
Zuckerberg DROPS AI BOMBSHELL: The End Of Software Engineers
19:41
Zuckerberg DROPS AI BOMBSHELL: The End Of ...
TheAIGRID
114,247 views
China's slaughterbots show WW3 would kill us all.
14:46
China's slaughterbots show WW3 would kill ...
Digital Engine
1,068,596 views
See Bolton's response to Trump's 'blacklist' post that called him 'dumb as a rock'
8:09
See Bolton's response to Trump's 'blacklis...
CNN
168,077 views
Michelle Obama To Skip Inauguration | Presidential Diet Coke | Bondi, Hegseth Grilled By Congress
12:39
Michelle Obama To Skip Inauguration | Pres...
The Late Show with Stephen Colbert
1,061,302 views
Mark Carney - Canada Not Interested in Trump’s Offer & Liberal Leadership Prospects | The Daily Show
19:40
Mark Carney - Canada Not Interested in Tru...
The Daily Show
2,013,635 views
Goodbye RAG - Smarter CAG w/ KV Cache Optimization
26:19
Goodbye RAG - Smarter CAG w/ KV Cache Opti...
Discover AI
35,036 views
Copyright © 2025. Made with ♥ in London by YTScribe.com