open AI just released 03 this is Agi and before you flame me in the comments let me explain why I'm saying this open AI is saying this and the arc prize is basically showing it when you see the results you will be stunned it is far better than anything else out there right now so for the last day of this event um we thought it would be fun to go from one Frontier Model to our next Frontier Model today we're going to talk about that Frontier Model um which you would think logically maybe should be called O2 um but out of respect to our friends at telica and in the grand tradition of open AI being really truly bad at names it's going to be called 03 all right so there it is their next Frontier Model this is beyond 01 and yes they had to skip O2 because there is a telecom company that is named o02 and they didn't want to obviously have copyright issues this is a brand new Next Generation Frontier Model Beyond 01 kind of insane to think about cuz it feels like we just got 01 and when he says reasoning he's talking about test time compute being able to give these models plenty of time to think through a problem and then come up with a solution this is the next generation of model this is the next generation of AI technology let's keep watching actually we're not going to launch uh not launch we're going to announce two models today 03 and 03 mini 03 is a very very smart model uh 03 mini is an incredibly smart model but still uh but a really good at performance and cost so to get the bad news out of the way first we're not going to publicly launch these today um the good news is we're going to make them available for Public Safety testing starting today you can apply so like he said we're not going to get the actual models we're not going to be able to play around with them yet and as soon as we can you know I'm going to do a full test of it but we get to see the benchmarks and they did Benchmark it against the arc prize and the results are stunning not only the arc prize but of course they benchmarked it on math coding reasoning everything and it is Leaps and Bounds better than anything else that we've seen we'll talk more at the end about when these models uh when we expect to make these models generally available but we're so excited uh to show you what they can do to talk about their performance got a little surprise we'll show you some demos uh and without further Ado I'll hand it over to Mark to talk about it cool thank you so much Sam so my name is Mark I lead research at open and I want to talk a little bit about O's capabilities now O3 is a really strong model at very hard technical benchmarks and I want to start with coding benchmarks if you can bring those up so on software style benchmarks we have sweet bench verified which is a benchmark consisting of real world software tasks all right so first of all here's the benchmarks you can see here's 01 preview 01 and then 03 at 71. 7% the sweet bench Benchmark is the best coding benchmark Mark out there it is real world coding task and 03 can accomplish it at a 71. 7% rate that is far better than anything else out there even far better than 01 then on the right you can see the competition code Benchmark and here's the ELO score on the Y AIS 01 preview 01 and all the way up here 03 just absolutely outstanding performance we're seeing that 03 performance at about 71.
7% accuracy which is over 20% better than our 01 models now this really signifies that we're really climbing the frontier of utility as well on competition code we see that 01 achieves an ELO on this contest coding site called code force is about 1891 at our most aggressive High test time compute settings we're able to achieve almost like a 2727 ELO here ju so Mark was a competitive programmer actually still coaches competitive programming very good what what is your I think my best at a comparable site was about 2500 that's tough well okay so let's just break down what just happened and I'm going to reference AGI one more time first what is the definition of AGI according to at least Sam Alman and open AI AGI is AI that outperforms humans at most economically viable work now this person Mark the head of research at open AI also a compet comption coder was beaten by 03 in the competition code Benchmark now if that is not AGI at least on this Dimension I don't know what is so think about this in chess the best person in the world Magnus Carlson has a ranking in Elo of 2831 the best AI chess engine is 3700 and above so it shows AI has now exceeded some of our best mins in these different Fields chess and cating okay so that is proof Point number one that AGI has been achieved and I want to mention one other thing open AI can't technically say AGI has been achieved or they can but here's the problem as soon as they Define one of their systems as AGI Microsoft no longer gets access to it and that is in their Charter so let's keep watching I I will say you know our chief scientist um this is also better than our chief scientist yakov's score I think there's one guy at opening eye who still like a 300 something yeah few more months tooy hopefully we have a couple months to enjoy there great that's I mean this is it's in this model is incredible programming yeah and not just programming but also mathematics so we see that on competition math benchmarks just like competitive programming we achieve very very strong scores okay let's pause again competition math benchmark 96. 7 almost a Flawless score this is a math machine then on the right we have have PhD level science questions nearly 10 points higher than 01 preview and 01 these are massive gains for this New Frontier Model 03 so 03 gets about 96. 7% accuracy versus an 01 performance of 83.
3% on the Amy what's your best Amy score I did get a perfect score once on so I'm safe but yeah okay that's super impressive he got a perfect score wow all right so nearly beating Mark the head of research at open AI really what this signifies is that 03 um often just misses one question whenever we test it on this very hard feeder exam for the USA mathematical Olympian there's another very tough Benchmark which is called gpq Diamond and this measures the model's performance on PhD level science questions here we get another state-of-the-art number 87. 7% which is about 10% better than our 01 performance which was at 78% just so I'm going to talk about why this is so important to these benchmarks in particular especially the PHD level science benchmark now one of the requirements for starting the intelligence explosion as per the situational awareness paper is having AI that can do self research and self-improvement and so if you have a model that is at the frontier of math the frontier of Science and is able to actually discover new science discover new math and then apply those new discoveries to itself that is the lit literal definition of what is required to hit the intelligence explosion and that's what we're seeing right here automated AI research so you take an 03 model clone it a million times and just let it run and let it self-improve indefinitely crazy to think about just to put this in perspective if you take an expert PhD they typically get about 70% in kind of their field of strength here so one thing that you might notice yeah from from some of these benchmarks is that that we're reaching saturation for a lot of them or nearing saturation so the last year has really highlighted the need for really harder benchmarks to accurately assess where our Frontier models Li where do you think he's going with this I already hinted at it at the beginning of the video but he's right these benchmarks are becoming saturated they're basically beaten or close to it and so that's why it's important to have a very hard Benchmark something that truly tests if it's AGI or not let's keep watching and I think a couple have emerged as fairly promising over the last months one in particular I want to call out is epic ai's Frontier math benchmark now you can see the scores look a lot lower than they did for the the previous benchmarks we showed and this is because this is considered today the toughest mathematical Benchmark out there this is a data set that consists of Novel unpublished and also very hard to extremely hard yeah very very hard problems even turn houses you know it would take professional mathematicians hours or even days to solve one of these problems and today all offerings out there um have less than 2% accuracy um on on this Benchmark and we're seeing with 03 in aggressive test time settings we're able to get over 25% that's a 10x Improvement on this Benchmark as compared to everything else that's come before it this is wild in addition to Epic AI Frontier math benchmark we have one more surprise for you guys so I want to talk about the arc Benchmark at this point but I would love to invite one of our friends Greg who is the president of the Ark foundation on to talk about this Benchmark wonderful Sam and mark thank you very much for having us today of course awesome Greg Cameron so very cool to see him on here I've talked to him a bunch he's such a cool nice guy and yeah he's running the arc Benchmark so really cool to see him on here let's see what they do with The Arc Benchmark hello everybody my name is Greg camrad and I the president of the arc prise Foundation now Arc prise is a nonprofit with the of being a North star towards AGI through and during benchmarks so our first Benchmark Arc AGI was developed in 2019 by Francois cholet in his paper on the measure of intelligence however it has been unbeaten for 5 years now in AI world that's like it feels like centuries is where it is so the system that beats Arc AGI is going to be an important Milestone towards general intelligence but I'm excited to say today that we have a new state-of-the-art score to announce before I get into that though I want to talk about what Arc AGI is so I would love to show you an example here Arc AGI is all about having input examples and output examples well they're good they're good okay input examples and output examples now the goal is you want to understand the rule of the transformation and guess it on the output so Sam what do you think is happening in here probably putting a dark blue square in the empty space see yes that is exactly it now that is really um it's easy for humans to uh intuitively guess what that is it's actually surprisingly hard for AI to know to understand what's going on so I want to show one more hard example here now Mark I'm going to put you on the spot what do you think is going on in this uh task okay so you take each of these yellow squares you count the number of colored kind of squares there and you create a border of that list that that is exactly and that's much quicker than most people so congratulations on that um what's interesting though is AI has not been able to get this problem thus far and even though that we verified that a panel of humans could actually do it now the unique part about Arc so I'm going to pause for a second just to reiterate what Greg already said the reason why the arc Benchmark the ark prize is so cool and so important is because it is so easy for most humans to do to accomplish to solve and simultaneously very very difficult for AI to do and and the reason why this is maybe the best definition of AGI is because it takes learning something basically taking what you've learned from one example and applying it to another example both of which have never been seen in the core training set and that's the important part so it's able to infer the solution to a new problem based on something it has seen in a previous prompt or previous piece of data let's keep watching AGI is every task requires distinct skills and what I mean by that is we won't ask there won't be another task that you need to fill in the corners with blue squares and but we do that on purpose and the reason why we do that is because we want to test the model's ability to learn new skills on the Fly we don't just wanted to uh repeat what it's already memorized that that's the whole point here now Arc AGI version one took 5 years to go from 0% to 5% with leading Frontier models however today I'm very excited to say that 03 has scored a new state-of-the-art score that we have verified on low compute for uh 03 it has scored 75. 7 on Arc ai's semi-private holdout set now this is extremely impressive because this is within the uh compute requirements that we have for our public leaderboard and this is the new number one entry on AR KGI Pub so congratulations to that thank so much yeah now uh as a capabilities demonstration when we ask 03 to think longer and we actually ramp up to high compute 03 was able to score 85.
7% on the same hidden holdout set this is especially important 87. 5 sorry 87. 5 yes this is especially important because um Human Performance is uh is comparable at 85% threshold so being Above This is a major Milestone and we have never tested A system that has done this or any model that has done this beforehand okay this is point number two why this is Agi and by the way if you disagree leave a comment below tell me why because this is the clearest proof this is not clickbait this is true so if 85% is the human average and now 03 beats that and essentially beats PhD level science expert level math what is left for the definition of AGI let's keep watching so this is new territory in the r AI World congratulations with that congratulations for making such a great Benchmark yeah um when I look at these scores I realize um I need to switch my worldview a little bit I need to fix my AI intuitions about what AI can actually do and what it's capable of uh especially in this 03 world but the work also is not over yet and these are still the early days of AI so um we need more enduring benchmarks like Arc AGI to help measure and guide progress and I am excited to accelerate that progress and I'm excited to partner with open AI next year to develop our next Frontier Benchmark amazing you know it's also a benchmark that we've been targeting been on our mind for a very long time so we're excited to work with you in the future worth mentioning that we didn't we Target and we think it's an awesome Ben we didn't go do specific this is just you know the general three but yeah yeah I'm glad they cleared that up they did not train on the ark prize they did not go out and specifically research the arc prize and how to make 03 work with it they just made 03 and it just happen happened to do really well on the arc prize really appreciate the partnership and this was a fun one to do absolutely and even though this has done so well Arc prize will continue in 2025 and anybody can find out more at ARC pri.
org great thank and by the way there is a million dooll prize for Arc prize and it seems like that might go to open AI I kind of hope it goes to a small team but it is what it is okay so next up we're going to talk about O3 mini um 03 mini is a thing that we're really really excited about and hongu who trained the model will come out and join us so not only we getting 03 but we also get the mini version which is going to be cheaper faster and probably nearly as performant hi everyone um I'm home uran I'm a open air researcher U working on reasoning so this September we released 01 mini uh which is a efficient reasoning model that in the 01 family that's really capable of uh math and coding probably one the best in the world given the low cost so now together with 03 I'm very happy to to uh tell you more about uh 03 mini which is a brand new model in the 03 family that truly defines a new cost efficient reasoning Frontier it's incredible um yeah though it's not available to our users today we are opening access to the model to uh our Safety and Security researchers to test model out um with the release of adaptive thinking time in the API a couple days ago for all three mini will support three different options low median and high reasoning effort so the users can freely adjust the uh thinking time based on their different use cases and that's really cool so if you already know your use case if you already know this is more of an easier problem or a harder problem you can adjust the setting accordingly and probably save a lot of money by doing so let's keep watching so for example for some we may want the model to think longer for more complicated problems and uh uh think shorter uh with like simpler ones um with that I'm happy to show the first set of evals of all three mini um so on the left hand side we show the coding owe so it's like code forces ELO which measures how good a programmer is uh and the higher is better so as we can see on the plot with more thinking time all three mini is able to have like increasing Yow all all performing all One Mini and with like median thinking time is able to measure even better than all1 yeah so it's like for an order of magnitude more speed and cost we can deliver the same code performance on this insurance right so although it's like the ultra Med high is still like a couple hundred points away from Mark it's not far that's better than me probably um but just an incredible sort of cost to Performance gain over what we've been able to offer with 01 and we think people will really love this yeah I hope so so on the right hand plot we showed the estimated cost versus cold forces Yow tradeoff uh so it's pretty clear that all stre minia defines like a new uh cost efficient reasoning Frontier on coding uh so it's achieve like better performance compar better performance than o1 with a fractional cost amazing um with that being said um I would like to do a live demo SS on Mini uh all right so we get to see it let's uh let's see how it goes so I'm testing out all three mini High first and the task is that um asking the model to to uh use Python to implement a code generator and executor so if I launch this uh run this like python script it will launch a server um and um locally with a with a with a UI that contains a text box and then we can uh make coding requests in a text box it will send the request to call all3 Mini API and ult mini API will solve the task and return a piece of code and it will will then uh save the code locally on my desktop and then open a terminal to execute the code automatically so it's a very complicated pretty complicated house right um and it also put like a big triangle code so if we copy the code and paste it to our server and then we would like to run launch This Server so we should get a text box when you're launching it yeah okay great oh yeah I see so it seems to be launching something um okay great we have a we have a UI where we can enter some coding prps let's try out a simple one like TR open the eye and a random number submit so it's sending the request to all3 mini median so you should be pretty fast right so on this all right so he used 03 to write code to then hit the 03 model separately So Meta so cool terminal yeah 41 that's number right so it saves the generated code to this like local script um on the desktop and print out open and for you want um is there any other task you guys want toy test it out I wonder if you could get it to get its own GP QA numbers that's EXA that's a great ask just as what I expected we prac this a lot yesterday um okay so now let me copy the code and send it in the code UI so um in this task we ask the model to evaluate all3 mini with the low reasoning effort on this hard gpq data set and the model needs to First download the the the raw file from this URL and then you needs to figure out which part is a question which part is a um which part is the answer and or which part is the options right and then formulate all the questions and to and then ask model to answer it and then part the result and then to grade it that's actually blazingly fast yeah and it's actually really fast because it's calling the al3 mini with low reasoning effort um yeah let's see how it goes guess two tasks are really hard here yeah the problem nothing like a live demo while we're waiting for this do you want to show the what the request was again mhm oh it actually Returns the results it's uh 61.