Unknown

0 views3882 WordsCopy TextShare

Unknown

Video Transcript:

[Music] good morning we have an exciting one for you today we started this 12-day event 12 days ago with the launch of 01 our first reasoning model it's been amazing to see what people are doing with that and very gratifying to hear how much people like it we view this as sort of the beginning of the next phase of AI where you can use these models to do increasingly complex tasks that require a lot of reasoning and so for the last day of this event um we thought it would be fun to go from one Frontier Model to our next Frontier Model today we're going to talk about that next Frontier Model um which you would think logically maybe should be called O2 um but out of respect to our friends at telica and in the grand tradition of open AI being really truly bad at names it's going to be called 03 actually we're going to launch uh not launch we're going to announce two models today 03 and O3 mini 03 is a very very smart model uh 03 mini is an incredibly smart model but still uh but a really good performance and cost so to get the bad news out of the way first we're not going to publicly launch these today um the good news is we're going to make them available for Public Safety testing starting today you can apply and we'll talk about that later we've taken safety Tes testing seriously as our models get uh more and more capable and at this new level of capability we want to try adding a new part of our safety testing procedure which is to allow uh Public Access for researchers that want to help us test we'll talk more at the end about when these models uh when we expect to make these models models generally available but we're so excited uh to show you what they can do to talk about their performance got a little surprise we'll show you some demos uh and without further Ado I'll hand it over to Mark to talk about it cool thank you so much Sam so my name is Mark I lead research at openai and I want to talk a little bit about O's capabilities now O is a really strong model at very hard technical benchmarks and I want to start with coding benchmarks if you can bring those up so on software style benchmarks we have sweet bench verified which is a benchmark consisting of real world software tasks we're seeing that 03 performs at about 71. 7% accuracy which is over 20% better than our 01 models now this really signifies that we're really climbing the frontier of utility as well on competition code we see that 01 achieves an ELO on this contest coding site called code forces about 1891 at our most aggressive High test time compute settings we're able to achieve almost like a 2727 ELO here ju so Mark was a competitive programmer actually still coaches competitive programming very very good what what is your I think my best at a comparable site was about 2500 that's tough well I I will say you know our chief scientist um this is also better than our chief scientist yakov's score I think there's one guy at opening eye who's still like a 3,000 something yeah a few more months to yeah enoy hopefully we have a couple months to enjoy there great that's I mean this is it's in this model is incredible at programming yeah and not just programing but also mathematics so we see that on competition math benchmarks just like competitive programming we achieve very very strong scores so 03 gets about 96. 7% accuracy versus an 01 performance of 83.

3% on the Amy what's your best Amy score I did get a perfect score once so I'm safe but yeah um really what this signifies is that 03 um often just misses one question whenever we tested on this very hard feeder exam for the USA mathematical Olympian there's another very tough Benchmark which is called gpq Diamond and this measures the model's performance on PhD level science questions here we get another state-of-the-art number 87. 7% which is about 10% better than our 01 performance which was at 78% just to put this in perspective if you take an expert PhD they typically get about 70% in kind of their field of strength here so one thing that you might notice yeah from from some of these benchmarks is that we're reaching saturation for a lot of them or nearing saturation so the last year has really highlighted the need for really harder benchmarks to accurately assess where our Frontier models lie and I think a couple have emerged as fairly promising over the last months one in particular I want to call out is epic ai's Frontier math benchmark now you can see the scores look a lot lower than they did for the the previous benchmarks we showed and this is because this is considered today the toughest mathematical Benchmark out there this is a data set that consists of Novel unpublished and also very hard to extremely hard yeah very very hard problems even turns houses you know it would take professional mathematicians hours or even days to solve one of these problems and today all offerings out there um have less than 2% accuracy um on on this Benchmark and we're seeing with 03 in aggressive test time settings we're able to get over 25% yeah um that's awesome in addition to Epic ai's Frontier math benchmark we have one more surprise for you guys so I want to talk about the arc Benchmark at this point but I would love to invite one of our friends Greg who is the president of the Ark foundation on to talk about this Benchmark wonderful Sam and mark thank you very much for having us today of course hello everybody my name is Greg camad and I the president of the arc prise Foundation now Arc prise is a nonprofit with the mission of being a North star towards AGI through and during benchmarks so so our first Benchmark Arc AGI was developed in 2019 by Francois cholle in his paper on the measure of intelligence however it has been unbeaten for 5 years now in AI world that's like it feels like centuries is where it is so the system that beats Ark AGI is going to be an important Milestone towards general intelligence but I'm excited to say today that we have a new state-of-the-art score to announce before I get into that though I want to talk about what Arc AGI is so I would love to show you an example here Arc AGI is all about having input examples and output examples well they're good they're good okay input examples and output examples now the goal is you want to understand the rule of the transformation and guess it on the output so Sam what do you think is happening in here probably putting a dark blue square in the empty space see yes that is exactly it now that is really um it's easy for humans to uh intu guess what that is it's actually surprisingly hard for AI to know to understand what's going on so I want to show one more hard example here now Mark I'm going to put you on the spot what do you think is going on in this uh task okay so you take each of these yellow squares you count the number of colored kind of squares there and you create a border of that with that that is exactly and that's much quicker than most people so congratulations on that um what's interesting though is AI has not been able to get this problem thus far and even though that we verified that a panel of humans could actually do it now the unique part about AR AGI is every task requires distinct skills and what I mean by that is we won't ask there won't be another task that you need to fill in the corners with blue squares and but we do that on purpose and the reason why we do that is because we want to test the model's ability to learn new skills on the Fly we don't just want it to uh repeat what it's already memorized that that's the whole Point here now Arc AGI version 1 took 5 years to go from 0% to 5% with leading Frontier models however today I'm very excited to say that 03 has scored a new state-of-the-art score that we have verified on low compute for uh 03 it has scored 75. 7 on Arc ai's semi private holdout set now this is extremely impressive because this is within the uh compute requirement that we have for our public leader board and this is the new number one entry on rkg Pub so congratulations to that thank so much yeah now uh as a capabilities demonstration when we ask o03 to think longer and we actually ramp up to high compute 03 was able to score 85.

7% on the same hidden holdout set this is especially important . 5 sorry 87. 5 yes this is especially important because um Human Performance is is comparable at 85% threshold so being Above This is a major Milestone and we have never tested A system that has done this or any model that has done this beforehand so this is new territory in the rcgi world congratulations with that congratulations for making such a great Benchmark yeah um when I look at these scores I realize um I need to switch my worldview a little bit I need to fix my AI intuitions about what AI can actually do and what it's capable of uh especially in this 03 world but the work also is not over yet and these are still the early days of AI so um we need more enduring benchmarks like Arc AGI to help measure and guide progress and I am excited to accelerate that progress and I'm excited to partner with open AI next year to develop our next Frontier Benchmark amazing you know it's also a benchmark that we've been targeting and been on our mind for a very long time so excited to work with you in the future worth mentioning that we didn't we Target and we think it's an awesome Ben we didn't go do specif you the general but yeah really appreciate the partnership this was a fun one to do absolutely and even though this has done so well AR priz will continue in 2025 and anybody can find out more at ARC pri.

org great thank you so much absolutely okay so next up we're going to talk about o03 mini um O3 mini is a thing that we're really really excited about and hongu who trained the model will come out and join us hey hey you hey um hi everyone um I'm H uran I'm open air researcher uh working on reasoning so this September we released 01 mini uh which is a efficient reasoning model that you the 01 family that's really capable of uh math and coding probably among the best in the world given the low cost so now together with 03 I'm very happy to uh tell you more about uh 03 mini which is a brand new model in the 03 family that truly defines a new cost efficient reasoning Frontier it's incredible um yeah though it's not available to our users today we are opening access to the model to uh our safety and the security researchers to test the model out um with the release of adaptive thinking time in the API a couple days ago for all three mini will support three different options low median and high reasoning effort so the users can freely adjust the uh thinking time based on their different use cases so for example for some we may want the model to think longer for more complicated problems and think shorter uh with like simpler ones um with that I'm happy to show the first set of evals of all three mini um so on the left hand side we show the coding evals so it's like code forces ELO which measures how good a programmer is uh and the higher is better so as we can see on the plot with more thinking time all3 mini is able to have like increasing Yow all all performing all1 mini and with like median thinking time is able to measure even better than all1 yeah so it's like for an order of magnitude more speed and cost we can deliver the same code performance on this for even better insurance right so although it's like the ultra Min high is still like a couple hundred points away from Mark it's not far that's better than me probably um but just an incredible sort of cost to Performance gain over been able to offer with o1 and we think people will really love this yeah I hope so so on the right hand plot we show the estimated cost versus Cod forces yellow tradeoff uh so it's pretty clear that all3 un defines like a new uh cost efficient reasoning Frontier on coding uh so it's achieve like better performance compar better performance than all1 is a fractional cost amazing um with that being said um I would like to do a live demo on ult Mini uh so um and hopefully you can test out all the three different like low medium high uh thinking time of the model so let me P the problem um so I'm testing out all three mini High first and the task is that um asking the model to uh use Python to implement a code generator and executor so if I launch this uh run this like python script it will launch a server um and um locally with a with a with a UI that contains a text box and then we can uh make coding requests in a text box it will send the request to call ult Mini API and Al mini API will solve the task and return a piece of code and it will then uh save the code locally on my desktop and then open a terminal to execute the code automatically so it's a very complicated pretty complicated house right um and it out puts like a big triangle code so if we copy the code and paste it to our server and then we like to run launch This Server so we should get a text box when you're launching it yeah okay great oh yeah I see hope so to be launching something um okay oh great we have a we have a UI where we can enter some coding prps let's try out a simple one like PR open the eye and a random number submit so it's sending the request to all3 mini medium so you should be pretty fast right so on this 4 terminal yeah 41 that's the magic number right so you say the generated code to this like local script um on a desktop and print out open 41 um is there any other task you guys want toy test it out I wonder if you could get it to get its own GP QA numbers that is that's a great ask just as what I expected we practice a lot yesterday um okay so now let me copy the code and send it in the code UI so in this task we asked the model to evaluate all three mini with the low reasoning effort on this hard gpq data set and the model needs to First download the the the raw file from this URL and then you need to figure out which part is a question which part is a um which part is the answer and or which part is the options right and then formulate all the questions and to and then ask the model to answer it and then par the result and then to grade it that's actually blazingly fast yeah and it's actually really fast because it's calling the all3 mini with low reasoning effort um yeah let's see how it goes I guess two tasks are really hard here yeah the long tail open the problem go go yeah g is a hard data set yes yeah it contain is like maybe 196 easy problems and two really hard problems um while we're waiting for this do you want to show the what the request was again mhm oh it's actually Returns the results it's uh 61.