BREAKING: OpenAI's new O3 model changes everything

180.19k views2328 WordsCopy TextShare

Theo - t3․gg

OpenAI's new O3 model is groundbreaking. So much so that I'll admit: I was wrong. Very, very wrong. ...

Video Transcript:

a few months ago I made a video about how AI development was stalling out we went from 5x improvements to 5% improvements it felt like the only thing growing exponentially was the cost shout out to my $200 a month Chachi BT Pro subscription today though open AI announced 03 and now I've seen the numbers and it's clear that I was wrong 03 is a Monumental leap in AI capabilities from code to math to science the biggest leap isn't any of those though it's the ark AGI test very few AI models have scored even a 5% on this test I'll show why in a second but first we need to see these numbers because the new 03 model is in the 76% range for their low usage and 88% for their High tune version it's insane 01 preview and 01 High were under 35% 01 Min is only eight and other models like Claude are not even close we have a lot to talk about here from the expenses to how this actually benefits in real world's stuff to what this AGI Arc test even is and we'll get there in a second but first a word from today's sponsor today's sponsor is browser base and I like them so much that I invested yes I'm that confident in this company their primary product is a web browser for your AI it's the easiest way to get an AI to browse the web for you tell it what to do where to get information from what pages to browse and it will figure it out everything from captas to parsing Pages it's really cool you can even test it out right on the site you can give it any of these example prompts like I can tell it to go to Hacker News and now it's going to figure out what I mean by that it's going to launch the browser it's writing all of this code on the Fly figuring out that needs to go to news. ycombinator docomo it's actually fully using playwright under the hood their service is a playright host so if you want to query a browser to do something their service is probably what you want it's so handy that they made an open source package called stageand it's an AI power successor to playright and it's really cool you can literally write automations by just telling it what you want to do stag hand. act click on the contributors extract the top contributors and you can pass it a Zod schema to validate the output from the page and this isn't something that's exclusive to browser base you can use this anywhere that you can host playwright with so anything that can do a headless Chrome can use this but obviously using their apis makes it really really simple to host browser base is dope and I think you'll like them a lot check them out today at soy.

link browser base historically AI developers like open Ai and anthropic have avoided this test because of how bad it makes them look not only is open AI no longer avoiding the test they actually brought the president of the foundation in to talk about it I could explain how it works but I'd much rather he do it hello everybody my name is Greg camad and I'm the president of the arc prise Foundation now Arc prise is a nonprofit with the mission of being a North star to towards AGI through and during benchmarks so our first Benchmark Arc AGI was developed in 2019 by Francois Chet and his paper on the measure of intelligence however it has been unbeaten for 5 years Arc AGI is all about having input examples and output examples now the goal is you want to understand the rule of the transformation and guess it on the output so Sam what do you think is happening in here probably putting a dark blue square in the empty space see yes that is exactly it it's actually surprisingly hard for AI to know to understand what's going on so I want to show one more hard example here now Mark I'm going to put you on the spot what do you think is going on in this uh task okay so you take each of these yellow squares you count the number of colored kind of squares there and you create a border of that with that that is exactly and that's much quicker than most people so congratulations on that um what's interesting though is AI has not been able to get this problem thus far and even though that we verified that a panel of humans could actually do it now the unique part about AR AGI is every task requires distinct skills there won't be another task that you need to fill in the corners of blue squares and but we do that on purpose and the reason why we do that is because we want to test the model's ability to learn new skills on the Fly Ark AGI version one took 5 years to go from 0% to 5% with leading Frontier models however today I'm very excited to say that 03 has scored a new state-ofthe-art score that we have verified on low compute for uh 03 it has scored 75. 7 now this is extremely impressive because this is within the compute requirements that we have for our public leaderboard and this has the new number one entry as a capabilities demonstration when we ask 03 to think longer and we actually ramp up to high compute 03 was able to score 85. 7% on the same hidden holdout set Human Performance is is comparable at 85% threshold so being Above This is a major Milestone and we have never tested A system that has done this or any model that has done this beforehand so this is new territory in the ri world when I look at these scores I realize um I need to switch my world deal a little bit I need to fix my AI intuitions about what AI can actually do and what it's capable of uh especially in this 03 world the one thing they didn't touch on there that I really think is worth considering here is cost here they actually show left to right how expensive these models are to run per task and you'll see that 01 mini was like 20 to 30 cents per task preview was a little over a dollar a task the new 01 models are more expensive as we know as I pay the $200 a month for Chachi BT Pro but these new models significantly better performance but also way more expensive they actually dropped the hard numbers down here admittedly they blanked out the cost per task for that low efficiency high performance version but you can see the retail cost for the private test 100 tasks cost them $2,000 to run to get that 75 score and each task was $20 and took over a minute to run and when you consider the less efficient version was 13.

8 minutes per almost at 14 minutes the cost is closer to $200 a task which means that test was 20 grand just in the compute and energy costs to run it's czy the level at which Hardware has become the bottleneck here as they call out the high compute version was 172 times more compute than the other version and all of these are still so much more than previous things I saw this tweet that I think is important to consider when we think about all these things it's important to understand how expensive the cost of this model is the assumption that was made both by the anti- doomers as well as the doomers themselves is that we would have a vast Hardware overhang the hardware overhang concept was that we'd have extra Hardware around once AI was developed and we had real GI so we'd have all this extra Hardware that could push the AI further and further so it get stronger and stronger really quick we don't have that spare Hardware though we thought the rapid Improvement would automatically happen because there was so much more compute Hardware out than there was AI being developed but now that three to five companies have 80% of the world's compute that's not the case and they're using all of it we can finally dismiss this as a fantasy there isn't vast Hardware overhang we're in fact incredibly Hardware Limited these days and we can barely afford the hardware for Cutting Edge AI models will models become much cheaper to run sure will we see some sort of fantasy hard takeoff happen in a situation where we're so Hardware limited absolutely not very important call out here without significantly more hardware and significantly better Hardware continuing to have Monumental leaps is nearly impossible the Monumental leap here in performance also comes with a similarly Monumental leap in cost and amount of hard running per task that said the results that we're seeing are insane we're talking about code Force's scores that put them in the top 2 code forces developers in the world we're talking about a model that can solve real scientific problems as well as PhD graduates can we're talking about a model that can generate a full project from scratch that does complex tasks with reasoning within them the prompt here is write me a python script which launches a server locally for an HTML file which has a big text box when I enter things into the box and press submit should send request to code to open A3 mini API with medium reasoning effort take the resulting code save it to a temporary file on the desktop then execute it in a new python terminal took 38 seconds to run let's see how it works copy the code and paste it to our server and then we would like to run launch This Server oh right we have a we have a UI where we can enter some coding PR let try out a simple one like PR open the eye and uh random number so it's sending the request to 03 Min medium so it should be pretty fast right so on this 40 terminal yeah 41 that's the even over right so um in this task we asked the model to evaluate all3 mini with the low raising effort on this hard gated ass set it's still using 03 mini behind the scenes but they prompted O3 mini to build them their own custom tool to then prompt O3 mini via the API here it asks for it's a generate code that evaluates o03 mini with low reasoning effort in the GP QA data set you need to download the raw file here answer yada y that's actually blazingly fast yeah and it's actually really fast because this isn't calling the all3 mini with low Reasoner oh it actually Returns the results it's uh 61.