the model announced tonight by open AI called 03 could well be the final reputation the artificial intelligence was hitting a wall open AI it seems have not so much as surmounted that wall they have supplied evidence that the wall did not in fact exist the real news of tonight isn't for me the 03 just crushed benchmarks designed to stand for decades it's that open AI have shown that anything you can Benchmark the O Series of models can eventually beat let me invite you to think of any challenge if that challenge is ultimately susceptible to reasoning and if the reasoning steps are represented anywhere in the training data the O Series of models will eventually Crush that challenge yes it might have cost 03 or open AI 350 Grand in thinking time to beat some of these benchmarks but costs alone will not hold the tide at Bay for long yes I'll give the caveats I always do and there are quite a few but I must admit and I will admit that this is a Monumental day in Ai and pretty much everyone listening should adjust their timelines before we get to the absolutely crazy benchmark scores what actually is 03 what did they do well I've given more detail on the 0 series of models in previous videos on this channel but let me give you a 30second summary I AI get the base model to generate hundreds or potentially thousands of candidate Solutions following long chains of thought to get to an answer a verifier model likely based on the same base model then reviews those answers and ranks them looking for classic calculation mistakes or reasoning mistakes that verifier model of course is trained on thousands of correct reasoning steps but here's the kicker in scientific domains like mathematics and coding you can know what the correct answer is so when the system generates a correct set of reasoning steps steps that led to the correct verified answer then the model as a whole can be fine-tuned on those correct steps this fundamentally shifts us from predicting the next word to predicting the series of tokens that will lead to an objectively correct answer that fine-tuning on just the correct answers can be classed as reinforcement learning so what then is 03 well more of the same as one researcher at open aai told us tonight 03 is powered by further scaling up reinforcement learning Beyond 01 no special ingredient added to 01 it seems no secret Source no wall and that's why I said in the intro if you can Benchmark it the O Series of models can eventually beat it what I don't want to imply though is that this leap forward with 03 was entirely predictable yes I talked about AI being on an exponential in my first video of this year and I even referenced verifiers and inference time compute that's the fancy term for thinking longer and generating more candidate Solutions but I am in pretty good company in not predicting this much of a leap this soon let's briefly start with Frontier Math and how did O3 do this is considered today the toughest mathematical Benchmark out there this is a data set that consists of Novel unpublished and also very hard extremely hard yeah very very hard problems even turn houses you know it would take professional mathematicians hours or even days to solve one of these problems and today all offerings out there have less than 2% accuracy on this Benchmark and we're seeing with 03 in aggressive test time settings we're able to get over 25% yeah they didn't say this in the announcement tonight but the darker part of the bar the smaller part is the model getting it right with only one attempt the lighter part of the bar is when the model gave lots of different solutions but the one that came up the most most often the consensus answer was the correct answer we'll get to time and cost in a moment but those details aside the achievement of 25% is Monumental here's what Terren to said at the beginning of November these questions are extremely challenging he's arguably the smartest guy in the world by the way I think that in the near term basically the only way to solve them short of having a real domain expert in the area is by a combination of a semi-expert like a grad student in a related field pair with some combination of a modern Ai and lots of other algebra packages given that 03 doesn't rely on algebra packages he's basically saying that O3 must be a real domain expert in mathematics summing up Terren to said that this Benchmark would resist AIS for several years at least Sam Orman seem to imply that they were releasing the full 03 perhaps in February or at least the first quarter of next year and that implies to me at least that they didn't just bust every single GPU on the planet to get this score but could never serve it realistically to the public or to phrase things another way we are not at the limits of the compute we even have available today the Next Generation 04 could be with us by quarter 2 of next year 05 by quarter 3 here's what another top open AI researcher said' 03 is very performant more importantly progress from 01 to03 was only 3 months which shows how fast progress will be in the new paradigm of reinforcement learning on Chain of Thought to scale influence compute way faster than the pre-training Paradigm of a new model every 1 to two years we may never get GPT 5 but get AGI anyway of course safety testing May well end up delaying the release to the public of these new generations of models and so there might end up being an increasingly wide gap between what the frontier Labs have available to use themselves and what the public has what about Google proof graduate level science questions and as one openai researcher put it take a moment of silence for that Benchmark it was born in November of 2023 and died just a year later why rip GP QA well 03 gets 87. 7% benchmarks are being crushed almost as quickly as they can be created then there's competitive coding where 03 establishes itself as the 175th highest scoring Global competitor better at this coding competition than 99. 95% of humans now you might say that's competition coding that's not real software engineering but then we had sbench verified that Benchmark tests real issues faced by real software Engineers the verified part refers to the fact that the Benchmark was combed for only genuine questions with real clear answers Claude 3.
5 Sonic gets 49% 03 71. 7% as foreseen you could argue by the CEO of anthropic the creators of Claude um the latest model we released Sonet 3. 5 the new updated version it gets something like 50% on sbench and sbench is an example of a bunch of professional real world software engineering tasks at the beginning of the year I think the state-of-the-art was three or 4% so in 10 months we've gone from 3% to 50% on this task and I think in another year we'll probably be at 90% I mean I don't know but might might even be might even be less than that before you ask by the way yes these were unseen programming competitions this isn't data contamination again if you can Benchmark it the O Series of models will eventually or imminently Beat It interestingly if you were following the channel closely you might have guessed that this was coming in code forces as of this time last year Google produced Alpha code 2 which in certain parts of the code forces competition outperformed 99.
5% of competition participants and they went on prophetically we find that performance increases roughly log linearly with more samples yes of course I'm going to get to Arc AGI but I just want to throw in my first quick caveat what happens if you can't Benchmark it or at least it's harder to Benchmark or the field isn't as susceptible to reasoning steps how about personal writing for example well as open AI admitted back in September the O Series of models starting with o1 VI is not preferred on some natural language tasks suggesting that is not well suited for all use cases again then think of a task is there an objectively correct answer to that task the O Series will likely soon beat it as O3 proof tonight that's regardless of how difficult that task is is the correctness of the answer or the quality of the output more a matter of taste however well that might take longer to be what about core reasoning though out of distribution generalization what I started this channel to cover back at the beginning of last year forgetting about cost or latency for a moment what we all want to know is how intrinsically intelligent are these models that will dictate everything else and I will raise that question through three examples to end the video the first is compositionality which came in a famous paper in nature published last year essentially you test models by making up a language full of con Concepts like between or double or colors and see if they can compose those Concepts into a correct answer the concepts are abstract enough that they would of course never have been seen in the training data the original GPT 4 flopped hard at this challenge in the paper in nature and 01 PR mode gets close but still can't do it after thinking for 9 minutes it successfully translates who as double but doesn't quite understand Moro it thinks it's something about symmetry but doesn't grasp that it means between will 03 master compositionality I can't answer that question because I can't yet test it next is of course my own Benchmark called Simple bench this video was originally meant to be a summary of the 12 days I was going to show off VO2 and talk about Gemini 2. 0 flash thinking experimental from Google the thinking this time invisible chains of thought is reminiscent then of the O Series of models on the three runs we've done so far it scores around 25% which is great for such a small model as flash but isn't quite as good as even their own model Gemini experimental 126 for this particular day of shipus we are though putting Google to one side because open AI have produced 03 so here's what I'm looking out for in 03 to see whether it would crush simple bench essentially it needs to master spatial reasoning now you can pause and read the question yourself but I helpfully supplied 01 prom mode with this visual as well and without even reading the question what would you say would happen to this glove if it fell off of the bike and let's say I also supplied you with the speed of the river well you might well say to me thanks for all of those details but honestly the glove is just going to fall onto the road 01 doesn't even consider that possibility and never does because spatial data isn't really in its training data nor is sophisticated social reasoning data wait let me caveat that of course we don't know what is in the training data I just suspect it's not in the training data of 01 at least likely not in 03 but we don't know is the base model for 03 oion or what would have been GPT 4. 5 GPT 5 open AI never mentioned a shift in what the base model was but they haven't denied it either someone could make the argument that 03 is so good at something like physics that it can Intuit it for itself what would happen in spatial reasoning scenarios maybe but we'd have to test it what I do have to remind myself though with simple bench and spatial reasoning more generally is it doesn't strike me perhaps as a fundamental limitation for the model going forward as I said right in the start of the intro to this video open AI have fundamentally with 03 demonstrated the extent of a generalizable approach to solving things in other words with enough spatial reasoning data and good spatial reasoning benchmarks and some more of that scaled up reinforcement learning I think models would get great at this too and frankly even if benchmarks like simple bench Can last a little bit longer because of a parity of spatial reasoning data or text-based spatial reasoning data not being enough you have simulators like genesis that can model physics and give models like O3 almost infinite training data of lifelike simulations you could almost imagine 03 404 being unsure of an answer spinning up a simulation spotting what would happen and then outputting the answer and now at last what about Ark AGI I made an entire video not that long ago about how this particular challenge created by France Chalet was a necessary but not sufficient condition for AGI the reason why 03 beating this Benchmark is so significant is because each example is supposed to be a novel test a challenge in other words that's deliberately designed not to be in any training data past or present beating it therefore has to involve at least a certain level of reasoning in case you're wondering by the way I think reasoning is actually a spectrum I Define it as deriving efficient functions and composite functions llms therefore always have done a form of reasoning it's just that their functions that they derive are not particularly efficient more like convoluted interpolations humans tend to spot things quicker have more meta rules of thumb and with these more meta rules of thumb we can generalize better and solve challenges that we haven't seen before more efficiently hence why many humans can see what has occurred to get from input one to Output one input two to Output two GPT 4 couldn't and even 01 couldn't really and for these specific examples even 03 can't yes it might surprise you there are still questions that aren't crazy hard the 03 Can't Get Right nevertheless 03 when given maximal compute what I've calculated at being 350 Grand worth gets 88% and here's what the author of that Benchmark said this isn't just Brute Force yes it's very expensive but these capabilities are new territory and they demand serious scientific attention we believe he said said it represents a significant breakthrough in getting AI to adapt to novel tasks reinforced again and again with those chains of thought or reasoning steps that led it to correct answers 03 has gotten pretty good at deriving efficient functions in other words it reasons pretty well now Chalet has often mentioned in the past that many of his smart friends scored around 98% in AR AGI but a fairly recent paper from September showed that that when an exhaustive study was done on average human performance it was 64.
2% on the public evaluation set Chalet himself predicted 2 and a half years ago that there wouldn't be a quote pure Transformer based model that gets greater than 50% on previously unseen Arc tasks within a time limit of 5 years again I want to give you a couple of quick caveats before we get to his assessment of whether o03 is Agi one open AI researcher admitted that it took 16 hours to get 03 to get 87. 5% with an increase rate of 3. 5% an hour to get to solved and another caveat this time from his public statement on 03 open aai apparently requested that they didn't publish the high compute costs involved in getting that high score but they kind of did anyway saying the amount of compute was roughly 172 X the low compute configuration if the low compute high efficiency retail cost was $22,000 by my calculation that's around $350 Grand to get the 87.