OpenAI Just Revealed They ACHIEVED AGI (OpenAI o3 Explained)

276.99k views2561 WordsCopy TextShare
TheAIGRID
Join my AI Academy - https://www.skool.com/postagiprepardness 🐤 Follow Me on Twitter https://twitt...
Video Transcript:
so today actually marks a very historic moment for the AI Community as it is going to be probably regarded as the day where AGI actually happened now if you guys don't know why this is the case this is because opening eye today released SL announced I guess you could say their new 03 model which is the second iteration of their 01 series the model that thinks for a long time now if you don't understand why this is potentially AGI this is because the new system managed to surpass human performance in the arc benchmark now the
reason that the arc Benchmark is such an important Benchmark is because it is resistant to memorization sure so Arc is intended as a kind of IQ test for machine intelligence and what makes it different from most benchmarks out there is that it's designed to be resistant to memorization so if you look at the way LMS work they're basically this uh big interpolative memory and the way you scalop the capabilities is by trying to cram as much uh knowledge and patterns as POS possible into them and uh by contrast Arc does not require a lot of
knowledge at all it's designed to only require what's known as core knowledge which is uh basic knowledge about things like U Elementary physics objectness counting that sort of thing um the sort of knowledge that any four-year-old or 5-year-old uh possesses right um but what's interesting is that each puzzle in Arc is novel is something that you've probably not encountered before even if you've memorized the entire internet now if you want to know what Arc actually looks like in terms of this test that humans are so easily able to pass but you know these AI systems
currently aren't you can take a look at the current examples right here in this video example here AGI is all about having input examples and output examples well they're good they're good okay input examples and output examples now the goal is you want to understand the rule of the transformation and guess it on the output so Sam what do you think is happening in here probably putting a dark blue square in the empty space see yes that is exactly it now that is really um it's easy for humans to uh intuitively guess what that is
it's actually surprisingly hard for AI to know to understand what's going on um what's interesting though is AI has not been able to get this problem thus far and even though that we verified that a panel of humans could actually do it now the unique part about AR AI is every task requires distinct skill skills and what I mean by that is we won't ask there won't be another task that you need to fill in the corners with blue squares and but we do that on purpose and the reason why we do that is because
we want to test the model's ability to learn new skills on the Fly we don't just want it to uh repeat what it's already memorized that that's the whole point here now Ark AGI version one took 5 years to go from 0% to 5% with leading Frontier models however today I'm very excited to say that 03 has scored a new new state-ofthe-art score that we have verified on low compute for uh 03 it has scored 75.7 on Ark AGI semi-private holdout set now this is extremely impressive because this is within the uh compute requirements that
we have for our public leaderboard and this is the new number one entry on rkg Pub so congratulations to that now I know those of you outside the AI Community might not think this is a big deal but this is a really big deal because this is something that we've been trying to Sol for I think around 5 years now this is a benchmark that many would have heralded to be the golden standard for AI and would of course Mark the first time that we've actually managed to get to a system that can actually outperform
humans at a task that traditionally AI systems would particularly fail at now what was interesting was that they had two versions so we had o03 with low tuning and we had 03 with high tuning so the 03 with low tuning is the low reasoning effort and this is the model that operates with minimum computational effort which is optimized for Speed and cost efficiency and this one is suitable for simpler task where deep reasoning is not required so you know basic coding straightforward tasks and then of course you have the high tuned one which is where
the model takes more time and resources to analyze and solve problems and this is optimized for performance on complex task requiring deeper reasoning or multi-step problem solving and what we can see here is that when we actually you know tune the model and make it think for longer we can see that we actually manage to surpass where humans currently are now what's crazy about this is that the people that have created The Arc AGI Benchmark said that the performance on Arc AGI highlights a genuine breakthrough in novelty adaptation this is not incremental progress we are
in New Territory so they start to ask is this AGI 03 still fails on some very easy tasks indicating fundamental differences with human intelligence and the guy that you saw at the beginning Francis cholet actually spoke about how he doesn't believe that this is exactly AGI but it does represent a big milestone on the way to WS AGI he says there's still a fair number of easy Arc AGI one talks to 03 can't solve and we have early indications that Ark AI 2 will remain extremely challenging 403 so he's stating that it shows that it's
feasible to create unsaturated interesting bench marks that are easy for humans yet impossible for AI without involving specialist knowledge and he States now which is you know some people could argue that he's moved the goal that we will have AGI when creating such evals become outright impossible but this is a little bit contrast to what he said earlier this year take a look at what he said 6 months ago about the Benchmark surpassing 80% which it did today turn question around to you so suppose that it's the case that in a year a multimodal model
can solve Arc let's say get 80% whatever the average human would get then AGI quite possibly yes I think if you if you start so honestly what I would like to see is uh an llm type model solving Arc at like 80% but after having only been trained on core knowledge related stuff now one of the limitations of the model is actually the compute cost so you can see right here that he says does this mean that the arc priz competition is beaten he says no The Arc priz competition targets the fully private data set
which is a different and somewhat harder evaluation but the arc prize is of course the one where your Solutions must run with a fixed amount of compute which is about 10 cents per TS and the reason that that is really interesting is because I'm not sure you guys may have seen this but if we actually take a look down the bottom here you can see that o03 high tuned the amount of compute that is being put into the model there means that this model cost around $11,000 per task which is pretty pretty expensive if you're
trying to use that AI for anything at all I think these models that search over many different solutions are going to be really really expensive and we can see that reflected here with this one being over $1,000 per task which is ridiculously expensive when you think about using it to perform any kind of task of course as we've seen with AI cost will eventually come down so the fact that they've managed to get the Benchmark is just the very best thing we can use the similarities between how technology was pretty bulky in the early days
you know like the big TVs the really bulky phones but now with technology you find ways to become more and more efficient and eventually you do things faster and of course cheaper so it's quite likely that this will happen in AI too and what I do find crazy about this is that you know two years ago he said that the arc AGI Benchmark fully solved is not going to be within the next 8 years and he says 70% hopefully less than 8 years perhaps four or five of course you can see AI managing to speed
past most people's predictions we also got the fact that it did very well on the swe bench which is of course a very hard software engineering bench and I'm guessing that if you are a software engineer this is probably not the best news for you but I'm sure that now with a bunch of people who are now coding there's probably a lot more demand for software Engineers that actually understand the code that's actually being written here but this is actually rather interesting because one thing that I realized when you know doing this video was the
fact that this is actually O2 and not 03 just want to go on a tangent quickly here because whilst I was reading this you know break through about 03 one thing that I needed to remember is that this is actually the second iteration of the model I think some people might be thinking that 03 is opening I third iteration but O2 is simply being skipped because there is a conflict with O2 which is a British mobile service provider so the fact that this is only the second iteration of the model it does go to show
that potentially 03 or even 04 is going to be quite a large jump and might reach Benchmark saturation now we also look at the math benchmarks and the PHD science level benchmarks we can see that there is a decent Improvement there as well I do think though that this is sort of reaching the Benchmark saturation area because we can see this one on competition math is around 96.7% and this one is 87.7% and like I said before one of the things that most people are starting to say is that like okay AI has slowed down
because it's no longer increasing at the same rates as it was before what we have to understand is that as these benchmarks get to 955 Plus or around that 89% the incremental gains are going to be harder and harder to to reach because number one you only have 10% left to get and number two it's quite likely that 3 to 5% of all the questions are contaminated meaning that there are errors in those questions anyways which means that 100% is simply not possible on certain benchmarks which is why they decided to create the frontier math
benchmark now at the time when the Benchmark was released which I think around 2 or 3 months ago these were the current models that could do only 2% on these questions you can see at the lead was Gemini 1.5 Pro Claude 3 Sonic and you can see 01 preview and 01 mini were there but for those of you that haven't realized just how good 02/03 is this is a model that gets 25% so this is something that is really really incredible when it comes to research math and you have to understand that this kind of
math is super super difficult and all of those questions are completely novel so the previous benchmarks that we recently had aren't truly showing the capabilities I mean think about it like this okay when we take a look at these benchmarks you can see that okay you would say maybe the model's gotten you know 10% better overall which is just not true because if we actually look at the really hard benchmarks where it's really solving unseen math problems we can see that there is a 20 times improvement over the current stateof the-art which is just absolutely
incredible so I think this is probably the very important image that people need to realize because you can't compare this kind of model even though it does do a massive increase on these kind of benchmarks you can't really compare it to this where it's having you know this insane level of scores now what's important to understand is that noan brown the person who works on reasoning at opening eye said that 03 is going going to continue in that trajectory so for those of you who are thinking you know maybe AI is slowing down it is
clear that that is not the case now interestingly enough we also got samman in an interview talking about what he believes AGI to be and I think like I said before the definition is constantly shifting and constantly changing it used to be a term that people used a lot and it was this really smart AI that was very far off in the future as we get closer to it I think it's become a less useful term people use it to mean very different things um some people use it to mean something that's not that different
than 01 you know uh and some people use it to mean true super intelligence something smarter than like all of humanity put together uh we try now to use these different levels we have a five level framework we're on level two with agents now sorry with uh reasoning now uh rather than the binary of is it AGI or is it not I think that became too horse as we get closer but I I will say um by the end of next year end of 25 I expect we will have systems that can do truly astonishing
cognitive tasks like where you'll use it and be like that thing is smarter than me at a lot of at a lot of hard problems so now can see the only thing here is that if you are a safety researcher he says please consider applying to help test 03 minion 03 because he's excited to get these out for General availability soon and he is extremely proud of the work the opening ey have been doing to creating these amazing models so it will be interesting to see when these models are actually released what people are going
to do with them what they're going to build and of course what happens next if you enjoyed this video let me know what your thoughts are on if this is AI or not I do think that that benchmark Mark has been broken and we're constantly finding new ways so it will be interesting to see where things head next
Copyright © 2025. Made with ♥ in London by YTScribe.com