open AI just dropped their 03 announcement on Friday December 20th it was the 12th day of their 12 Days of Christmas announcements and I'm beyond delighted about what this means for all of us in the AI space this isn't just another incremental update we're witnessing a fundamental shift in what AI can do listen carefully when I ventured into AI in January 2023 as someone who ran a 100 person writing agency with over 60,000 completed client projects I was skeptical I knew AI was just detecting generating predicting not originating not creating not innovating but 03 this changes everything we're looking at AGI we're looking at a model that doesn't just match Human Performance it exceeds it I'm talking 87. 5% on the arc AGI Benchmark where humans score 85% 87. 7% on PhD level science questions where human experts typically score 70% a 2727 ELO rating in competitive programming in October this year when I started first movers I talked with owners of hundred million companies they were taking their team from 85 down to two folks assisted with agentic AI processes this wasn't exaggeration it was a preview you of what's coming it's bloodbath times folks and 03 is about to accelerate this transformation exponentially the gap between companies that adapt and those that don't is about to become a Chasm to everyone who thought I was exaggerating when I talked about the AI Revolution this is why I've been pushing so hard for adaptation this is why I launched first movers this is why I've been saying we need to get ahead of what's coming the wealth transfer to the adapters isn't coming it's happening now remember this AI right now is the worst it'll ever be sit with that thought Future's bright are you ready to become a first mover let's go and now I want to share some important perspectives first we'll hear what YouTuber Matt Burman said second insights from my good friend Dave Shapiro third an explanation of the arc Benchmark fourth chat gpt's huge 03 announcement and finally a special message from me enjoy it is far better than anything else out there right now really what AI is proving is the theory of multiple intelligences so you know these these tools are already superhuman in that they can write faster than us they can think faster than us many of them know more than we will ever know so just in terms of raw knowledge they're better at writing I mean I made I made a Claude style today that writes to a degree that I am I'm just not capable of at least not it takes many many iterations but it can do it in one shot um so semantic intelligence and those sorts of things um humans with practice we're still better at intuition and large scale pattern matching um you know this kinds of things maintaining coherence uh Claude will sometimes become incoherent um but really what we're seeing is particularly with 03 is we're solving uh reasoning and we're solving problem solving and so taking a big step back what this means is or all right so here's this is just kind of my personal understanding of intelligence is the the universe is only so complex right there are once you master math and physics and chemistry and coding what else is there right once you understand the basic building blocks of reality there's not any more else to learn um yes these things can be recombined and infinite combinations there's emergent characteristics which means that there's technically infinite complexity to deal with but at the same time reality is very consistent and so I guess the simplest way of saying it is AI is getting very close to circumscribing reality um AI is getting very close to mastering reality in and and it's not just you know oh well it still has some gaps in what it can do sure who cares right but you know I I even saw Beth Jos Gil verun saying oh embodied a embodied AI is going to solve everything else no it's not embodiment data makes no difference on solving you know Advanced uh physics or nuclear fusion or antimatter reactors or those sorts of things um ducks can move through the world your cat has better embodied data than you do um orangutans and chimps have better embodiment data than you do human embodiment is in no way uh the way to to solve intelligence um but yeah so we're it's basically solving reality um and solving and and and having a complete grasp of logic and reasoning and those sorts of things which this is the kind of problem that you only need to solve once now that being said there could be an arms race where there's you know the ability to handle uh increasingly complex you know problems or narratives or whatever but who cares at the end of the day if you have you know plants growing in the dirt if you have um energy to use and minerals to use and you can recycle everything you know like basically reality itself becomes the ground truth um and that is the shared sandbox that we're all in and AI is on the cusp of mastering the rules of the game that we're all playing better than all humans combined um now let's say let's just say for instance that um 03 in aggregate H is in like the 75th or I guess the top 25th percentile of all humans in terms of all intellectual capabilities it might be in the top 20th percentile or top desile or whatever but in aggregate there are still a billion humans that collectively can do way more than the AI can do okay great but then what happens when you scale it up and what happens when this trend continues so yeah that's that's kind of my take on where we're at so a is intended as a kind of I Q test for machine intelligence and what makes it different from most benchmarks out there is that it's designed to be resistant to memorization so if you look at the way LMS work they are basically this uh big interpolative memory and the way you scale up their capabilities is by trying to cram as much uh knowledge and pattern as possible into them and uh by contrast Arc does not require a lot of knowledge at all it's designed to only require what's known as core knowledge which is uh basic knowledge about things like um Elementary physics objectness counting that sort of thing um the sort of knowledge that any four-year-old or 5-year-old uh possesses right um but what's interesting is that each puzzle in Arc is novel is something that you've probably not encountered before even if you've memorized the entire internet good morning we have an exciting one for you today we started this 12-day event 12 days ago with the launch of 01 our first reasoning model it's been amazing to see what people are doing with that and very gratifying to hear how much people like it we view this as sort of the beginning of the next phase of AI where you can use these models to do increasingly complex tasks that require a lot of reasoning and so for the last day of this event um we thought it would be fun to go from one Frontier Model to our next Frontier Model today we're going to talk about that next Frontier Model um which you would think logically maybe should be called O2 um but out of respect our friends at telica and in the grand tradition of open AI being really truly bad at names it's going to be called 03 actually we're not going to launch uh not launch we're going to announce two models today 03 and 03 mini 03 is a very very smart model uh 03 mini is an incredibly smart model but still uh but a really good at performance and cost so to get the bad news out of the way first we're not going to publicly launch these today um the good news is we're going to make them available for Public Safety testing starting today you can apply and we'll talk about that later we've taken safety Tes testing seriously as our models get uh more and more capable and at this new level of capability we want to try adding a new part of our safety testing procedure which is to allow uh Public Access for researchers that want to help us test we'll talk more at the end about when these models uh when we expect to make these models generally available but we're so excited uh to show you what they can do to talk about their performance got a little surprise we'll show you some demos uh and without further Ado I'll hand it over to Mark to talk about it cool than thank you so much Sam so my name is Mark I lead research at openai and I want to talk a little bit about O's capabilities now O3 is a really strong model at very hard technical benchmarks and I want to start with coding benchmarks if you can bring those up so on software style benchmarks we have sweet bench verified which is a benchmark consisting of real world software tasks we're seeing that 03 performs at about 71.
7% accuracy which is over 20% better than our 01 models now this signifies that we're really climbing the frontier of utility as well on competition code we see that 01 achieves an ELO on this contest coding site called code forces about 1891 at our most aggressive High test time compute settings we're able to achieve almost like a 2727 ELO here ju so Mark was a competitive programmer actually still coaches competitive programming very very good what what is your I think my best at a comparable site was about 2500 that's tough well I I will say you know our chief scientist um this is also better than our chief scientist yakov's score I think there's one guy at opening eye who still like a 3,000 something yeah few more months to yeah hopefully we have a couple months to enjoy there great that's I mean this is it's in this model is incredible programming yeah and not just programming but also mathematics so we see that on competition math benchmarks just like competitive programming we achieve very very strong scores so 03 gets about 96. 7% accuracy vers versus an 01 performance of 83. 3% on the Amy what's your best Amy score I did get a perfect score once on so I'm safe but yeah really what this signifies is that 03 um often just misses one question whenever we test it on this very hard feeder exam for the USA mathematical Olympian there's another very tough Benchmark which is called gpq Diamond and this measures the models performance on PhD level science questions here we get another state-of-the-art number 87.
7% which is about 10% better than our 01 performance which was at 78% just to put this in perspective if you take an expert PhD they typically get about 70% in kind of their field of strength here so one thing that you might notice yeah from from some of these benchmarks is that we're reaching saturation for a lot of them or nearing saturation so the last year has really highlighted the need for really harder benchmarks to accurately assess where our Frontier model SL and I think a couple have emerged as fairly promising over the last months one in particular I want to call out is epic ai's Frontier math benchmark now you can see the scores look a lot lower than they did for the the previous benchmarks we showed and this is because this is considered today the toughest mathematical Benchmark out there this is a data set that consists of Novel unpublished and also very hard extremely hard yeah very very hard problems even turn houses you know it would take professional mathematicians hours or even days to solve one of these problems and today all offerings out there um have less than 2% accuracy um on on this Benchmark and we're seeing with 03 in aggressive test time settings we're able to get over 25% yeah um that's awesome in addition to Epic AI Frontier math benchmark we have one more surprise for you guys so I want to talk about the arc Benchmark at this point but I would love to invite one of our friends Greg who is the president of the arc foundation on to talk about this Benchmark wonderful Sam and mark thank you very much for having us today of course hello everybody my name is Greg camad and I'm the president of the arc prise Foundation now Arc prise is a nonprofit with the mission of being a North star towards AGI through and during benchmarks so our first Benchmark Arc AGI was developed in 2019 by Francois cholet in its paper on the measure of intelligence however it has been unbeaten for 5 years now in AI world that's like it feels like centuries is where it is so the system that beats AR AGI is going to be an important Milestone towards general intelligence but I'm excited to say today that we have a new state-of-the-art score to announce before I get into that though I want to talk about what Arc AGI is so I would love to show you an example here Arc AGI is all about having input examples and output examples well they're good they're good okay okay input examples and output examples now the goal is you want to understand the rule of the transformation and guess it on the output so Sam what do you think is happening in here probably putting a dark blue square in the empty space see yes that is exactly it now that is really um it's easy for humans to uh intuitively guess what that is it's actually surprisingly hard for AI to know to understand what's going on so I want to show one more hard example here now Mark I'm going to put you on the spot what do you think is going on in this uh task okay so you take each of these yellow squares you count the number of colored kind of squares there and you create a border of that with that that is exactly and that's much quicker than most people so congratulations on that um what's interesting though is AI has not been able to get this problem thus far and even though that we verified that a panel of humans could actually do it now the unique part about AR AGI is every task requires distinct skill SKS and what I mean by that is we won't ask there won't be another task that you need to fill in the corners with blue squares and but we do that on purpose and the reason why we do that is because we want to test the model's ability to learn new skills on the Fly M we don't just want it to uh repeat what it's already memorized that that's the whole point here now Ark AGI version one took 5 years to go from 0% to 5% with leading Frontier models however today I'm very very excited to say that 03 has scored a new state-of-the-art score that we have verified on low compute for uh 03 it has scored 75. 7 on Ark ai's semi-private holdout set now this is extremely impressive because this is within the uh compute requirements that we have for our public leaderboard and this is the new number one entry on RC Pub so congratulations to that thank so much yeah now uh as a capabilities demonstration when we ask 03 to think longer and we actually ramp up to high compute 03 was able to score 85. 7% on the same hidden holdout set this is especially important .
5 sorry 87. 5 yes this is especially important because um Human Performance is uh is comparable at 85% threshold so being Above This is a major Milestone and we have never tested A system that has done this or any model that has done this beforehand so this is new territory in AR AGI World congratulations with that congratulations for making such a great Benchmark yeah um when I look at these scores I realize um I need to switch my worldview a little bit I need to fix my AI intuitions about what AI can actually do and what it's capable of uh especially in this 03 world but the work also is not over yet and these are still the early days of AI so um we need more enduring benchmarks like AR AGI to help measure and guide progress and I am excited to accelerate that progress and I'm excited to partner with open AI next year to develop our next Frontier Benchmark amazing you know it's also a benchmark that we've been targeting been on our mind for a very long time so excited to work with you in the future worth mentioning that we didn't we Target and we think it's an awesome Ben we didn't go do speciic this is just you know the general of three but yeah really appreciate the partnership and this was a fun one to do absolutely and even though this has done so well Arc prize will continue in 2025 and anybody can find out more at ARC prize . org great thank you so much absolutely okay so next up we're going to talk about O3 mini um o03 mini is a thing that we're really really excited about and hongu who trained the model will come out and join us hey ni you hey um hi everyone um I'm H uran I'm a open air researcher uh working on reasoning so this September we released 01 mini uh which is a efficient reasoning model the you know 01 family that's really capable of math and coding probably one of the best in the world given the low cost so now together with 03 I'm very happy to uh tell you more about uh 03 mini which is a brand new model in the 03 family that truly defines a new cost efficient reasoning Frontier it's incredible um yeah though it's not available to our users today we are opening access to the model to uh our Safety and Security researchers through test model app um with the release of adaptive thinking time in the API a couple days ago for all3 Min will support three different options low median and high reasoning effort so the users can freely adjust the uh thinking time based on their different use cases so for example for some we may want the model to think longer for more complicated problems and uh think shorter uh with like simpler ones um with that I'm happy to show the first set of evals of all three mini um so on the left hand side we show the coding evil so it's like code forces ELO which measures how good a programmer is uh and the higher is better so as we can see on the plot with more thinking time all three mini is able to have like increasing yellow or outperforming o One Mini and with like median thinking time is able to measure even better than yeah so it's like for an order of magnitude more speed and cost we can deliver the same code performance on this insurance right so although it's like the ultra Min high is still like a couple hundred points away from Mark it's not far that's better than me probably um but just an incredible sort of cost to Performance gain over what we've been able to offer with 01 and we think people will really love this yeah I hope so so on the right hand plot we show the estimated cost versus cold forces yo tradeoff um uh so it's pretty clear that all three mini defines like a new uh cost efficient reasoning Frontier on coding uh so it's achieve like better performance compared better performance than 01 with a fractional cost amazing um with that being said um I would like to do a live demo on ult Mini uh so um and hopefully you can test out all the three different like low medium high thinking time of the model so let me P the promp um so I'm testing out all three mini High first and so Tas is that I'm asking the model to uh use Python to implement a code generator and executor so if I launch this uh run this like python script it will launch a server um and um uh locally with a with a with a UI that contains a text box and then we can uh make coding requests in a text box it will send the request to call old stre Mini API and old stre Mini API will solve the task and return a piece of code and it will then uh save the code locally on my desktop and then open a terminal to execute the code automatically so it's a very complicated pretty complicated T right um and it all put like a big triangle code so if we copy the code and paste it to our server and then we like to run launch This Server so we should get a text box when you're launching it yeah okay great oh yeah I hope so it seems to be launching something um okay oh great we have a we have a UI where we can enter some coding PRS let's try out a simple one like PR open the eye and a random number subit so it's sending the request to Ultra Mini medium so you should be pretty fast right so on this terminal yeah 41 that's the magic number right so it saves the generated code to this like local script um on the desktop and the print out opening and 41 um is there any other tasks you guys want toy test it out I wonder if you can get it to get its own gpq numbers that's EXA that's a great ask just as what I expected prac this a lot yesterday um okay so now let me copy the code and send it in the code UI so um in this task we asked the model to evaluate all3 mini with the low reasoning effort on the this hard gpq set and the model needs to First download the the the raw file from this URL and then you need to figure out which part is a question which part is a um which part is the answer and or which part is the options right and then formulate all the questions and to and then ask the model to answer it and then part the result and then to grade it that's actually blazingly fast yeah and it's actually really fast because it's calling the all three mini with low reasoning effort um yeah let's see how it goes guess two tasks are really hard here yeah the long tail ofing the problem go go yeah is a hard dat set yes yeah your count is like maybe 196 easy problems and two really hard problems um while we're waiting for this do you want to show the what the request was again mhm oh it actually Returns the results it's uh 61.
62% 6 62% right with a low reasoning effort model it's actually pretty fast then full evaluation in the uh in the a minute and somehow very cool to like just ask a model to evaluate itself like this yeah exactly right and if we just summarize what we just did we ask a model to write a script to evaluate itself um through on this like hard created Set uh from UI right from this code generator and executor created by the model itself in the first place next year we're going to bring you on and you're going to have to improve ask the model to improve it so yeah yeah let's definely ask the model to improve it next time maybe not um um so um besides code forces and gpq the model is also a pretty good um um math model so we we show on this plot uh with like on this am 2024 data set all3 mini low achieves um comparable performance with 01 mini and all3 mini medium achieves a comparable better performance than 01 we check the solid bar which are pass once and we can further push the performance with 03 mini high right and on the right hand side plot when we measure the latency on this like anonymized om preview traffic we show that 03 mini low drastically reduce the latency of 01 mini right almost like achieving comparable latency with uh gbt 4 where under a second so probably is like instant response and also mini medium is like half the latency of o1 um and here's another set of eval I'm even more excited to to show you guys is um uh API features right we get a lot of requests from our developer communities to support like function calling structured outputs developer messages all mini serice models and here here um 03 mini will support all these features same as 01 um and notably it achieves like comparable better performance than for all on most of the eval providing a more cost effective solution to our developers cool um and if we actually enveil the true gpq Diamond performance that I run a couple days ago and it actually also mean l is actually 62% right we basically as value your yeah right next time you should totally just ask model to automa do the evaluation in says ask um yeah so with that um that's it for alter Mei and I hope our user can have a much better user experience in already next year fantastic work yeah thank you really great work thank you cool so I know you're excited to get this in your own hands um and we're very working very hard to Post train this model to do some uh safety interventions on top of the model and we're doing a lot of internal safety testing right now but something new we're doing this time is we're also opening up this model to external safety testing starting today with O3 mini and also eventually with 03 so how do you get Early Access as a safety researcher or a security researcher you can go to our website and you can see a form like this one that you see on the screen and applications for this form are rolling they'll close on January 10th and we really invite you to apply uh we're excited to see what kind of things that you can explore with this and what kind of um jailbreaks and other things you discover cool so one other thing that I'm excited to talk about is a a new report that we published I think yesterday or today um that advances our safety program and this is a new technique called deliberative alignment typically when we do safety training on top of our models we're trying to learn this decision boundary of what's safe and what's unsafe right and usually it's uh just through showing examples pure examples of this is a safe prompt this is a unsafe prompt but we can now leverage the reasoning capabilities that we have from our models to find a more accurate safety boundary here and this technique called deliberative alinement allows us to take a safety spec allows the model to reason over a prompt and also just tell you know is this a safe prompt or not often times within the reasoning it'll just uncover that hey you know this user is trying to trick me or they're expressing this kind of intent that's hidden so even if you kind of try to Cipher your your prompts often times the reasoning will break that and the primary result you see is in this figure that's shown over here we have um our performance on a rejection Benchmark on the x-axis and on over refusals on the y-axis and here uh to the right is better so this is our ability to accurately tell when we should reject something also our ability to tell when we should review something and typically you think of these two metrics as having some sort of tradeoff it's really hard to do well I'm the it is really hard to do yeah um but it seems with deliberative alignment that we can get these two green points on the top right whereas the previous models the red and blue points um signify the performance of our previous models so we're really starting to leverage safety to get sorry leverage reasoning to get better safety yeah I think this is a really great result of safety yeah fantastic okay so to sum this up 03 mini and 03 apply please if you'd like for safety testing to help us uh test these models as an additional step we plan to launch O3 mini around the end of January and full 03 shortly after that but uh that will you know the more people can help us safety test the more we can uh make sure we hit that so please check it out uh and thanks for following along with us with this it's been a lot of fun for us we hope you've enjoyed it too Merry Christmas if you made it to the end of this video I'm going to drop a potential bomb maybe you saw this coming maybe you didn't so a lot of you in the comments are starting to speculate that I am AI some of you have said that Dave Shapiro created me as another AI form in which to disseminate information I love the speculation but I want to tell you something crazy everything you just watched from me in the first part of this video before we got into the clips from open AI the clips from the other YouTuber I quote the audio from Dave Shapiro so in that first clip where I'm talking none of that I filmed or said that was my AI clone we at a time where generative AI alone I'm talking models like open AI 03 releasing soon but right now we're using the 01 Pro hens digital Avatar 11 Labs PDC professional voice clone we're at a time where we can get such highres capabilities out of generative AI we don't need to ask any longer can we clone ourselves because I just proved to you that it's possible again another Milestone proving that by all definitions AGI is here and it's simply up to us how how to Steward it and the entirely wrong question you could be asking right now is how many jobs will this take how much will I lose how much do I stand to not gain instead you should be asking how can I adapt my life to a better work Paradigm what if I could have a marketing strategy that beat out every competitor and what if I could generate it in seconds with the right AI model and the right process oh my gosh that begins to open up your work world so stop stressing out Stop Believing the Skynet narrative as most of you know I shared in a former YouTube video that I have a book releasing Christmas Day called Liberation through the machines it will flip the narrative on its head in a fiction tale written by AI guided edited ideed by me and I think this fiction will blow the lid wide open on how you see Ai and the capabilities of it honestly we don't need some evil and some good good Terminator we don't even need to think in terms like that instead we need to be asking this one question how do I harness AI to remove the biggest time energy and soul sucks in my business and my life when you begin to ask that question and you are open most importantly to changing everything about the way you've lived functioned worked then my friend you're going to set yourself up for success in the new age that's coming so from my digital clone and myself welcome to the future if you'd like to become a first mover in your industry go to First movers.
What are the key takeaways?
Based on the transcript, here are the key points...