o3-mini and the “AI War”

25.71k views2706 WordsCopy TextShare

AI Explained

o3-mini is here, and yes, I’ve read the paper in full - 2 hours after release, and even the post-lau...

Video Transcript:

o03 mini is here and it's mini in name but is it mini in performance well it kind of depends on whether you want coding and mathematics help or for your model to feel smart in a conversation but is it me or does everything in AI feel more hectic now after deep seek R1 releases being brought forward according to Sam Alman open AI CEO AI models smarter than all humans within 20 to 30 months according to the anthropic CEO Dario amaday and no less than an AI War according to scale AI CEO Alexander Wang somewhat irresponsibly the dudes in his mid-20s he needs to chill but yes in the first 9 minutes after its release I've read the 37 page System card report on 03 mini and the full release notes and by the way I have tested it against deep seek R1 to get my first impressions at least so let's get started with the key highlights the first thing that you should know is that if you're using chat GPT for free you will get access to O3 mini just select reason after you type your prompt in chat gbt O3 mini doesn't support Vision so you can't send it images but apparently it pushes the frontier of costeffective reasoning but with deep seek R1 as a competitor I kind of want to see the evidence of that because yes it is cheap and fairly smart but the Deep seek R1 reasoning model I would would say is smarter overall and significantly cheaper for those who use the API some input tokens are $11 per million for o03 mini versus 14 cents per million for deep seek R1 for output tokens O3 mini is $440 and for deep seek R1 $219 by my rough mathematics 03 mini would have to be roughly twice as smart at least to be pushing the costeffective frontier forward we see in a moment why I am slightly skeptical over that albeit with one big caveat okay but what are you actually going to do with those 150 messages you now get on the plus tier or the $20 tier of chat GPT well you may be interested in competition mathematics in which it performs really well better than 01 on the high setting if you someone who found this particular chart impressive for o03 mini then wait for a stat that I think many many people are going to skim over and miss in these release notes I literally saw the stat and then did a double take and had to do some research around it and here is that stat and it pertains to Frontier math which is a notoriously difficult Benchmark co-written by Terren tow arguably one of the smartest men on the planet at first glance you look at O3 mini on high setting you go ah not that great but then you remember that wait this is pass at one pass first time with your first answer that 9. 2% performance is comparable to 03 or at least 03 as it was when it was announced in December I'm sure they have improved it further but the crazy thing is that isn't even the stat that I mean that caus me to do a double take it was this one on Frontier math when prompted to use a python tool O3 mini with high reasoning effort solves over 32% of problems on the first attempt now I know it's not perfect Apples to Apples because 03 wasn't given access two tools but remember it was o03 getting 25% on this Benchmark that caused everyone including me to really sit up it seems actually that that 25% was a massive underestimate of what it will be able to do in its final form with tools I'm going to make this very Vivid for you in just a few seconds but remember this stat it also got 28% of the midlevel tier three challenging problems on page 23 of the frontier math paper we learn what they categorize as a low difficulty tier one problem how many nonzero points are there with the conditions fulfilling that equation and of course it is 3. 8 trillion if you guys didn't immediately sus the answer would be around 3.

8 trillion then honestly you need to comment and apology no this is more like it this is a medium difficulty problem this is a tier three problem this looks like it would definitely take me a few minutes so 03 mini got uh 28% on this level of question suffice to say it is pretty good at mathematics yes of course it's great at Science 2 with comparable performance to 01 on one particularly hard science benchmark the GP QA let's do some more good news before we get to the bad news encoding O3 mini is legit insane beating deep seek R1 even on medium settings oh and beating 01 too of course as well I was using cursor AI for about 8 hours I would say today and honestly it will be really interesting to see if o03 mini displaces claw 3. 5 Sonic as the model of choice but here's the thing if O3 mini were a human and scored like this in coding and Mathematics and Science you would think that person must be all around insanely intelligent but progress in AI is somewhat unpredictable so check out this basic reasoning problem Peter needs CPR from his best friend Paul the only person around however Paul's last text exchange with Peter was about a verbal attack Paul made on Peter as child over his overly expensive Pokémon collection and Paul stores all his text in the cloud permanently so as children they had disagreements over Pokémon but remember it's his best friend and he needs CPR will Paul help Peter almost every model be it deep seek R1 01 or Claude 3. 5 Sonic or many others say definitely what does the prodigiously intelligent 03 mini say well probably not his Heart's Not not in it if you thought that was a one-off by the way no it gets only one out of these 10 simple bench public questions oh but maybe all models fail on those kind of questions well not really deep seek R1 gets four out of the 10 public ones and 31% overall on The Benchmark Claude 3.

5 Sonic gets five out of those 10 public questions correct and 41% on the overall Benchmark of course we're going to run 03 mini on the full Benchmark the moment the API becomes available and yes I know some of you will want to know more about the competition that's ending in less than 12 hours so more on that towards the end of the video as the release notes go on though it does start to feel more like a product release rather than a research release what I mean by that is certain stats are cherry-picked and their language becomes about cost and latency you can basically feel the shift within open AI from a fully research company to a product and research company take this human preference evaluation but the bar of performance is 01 mini and that happens quite a few times what about the win rate versus deep seek R1 or Claude 3. 5 Sonic on latency or reaction speed it's great that it's faster than 01 mini but is it faster than Gemini 2 flash we don't know now I do get by the way why open AI is acting more like a corporation than a research team these days after all their valuation according to reports in Bloomberg yesterday has just doubled now I want you guys to find a quote that I put out towards the end of last year I can't find it but I directly predicted in a video that their valuation would double from 150 billion and I remember saying in 2025 but I don't think I put a date on it that of course was the fun news for open aai but the system card for 03 mini contains some not so fun news the tldr is this that open AI have committed to not releasing publicly or deploying a model that scores high on their evaluation for risk indeed O3 mini is the first model for example to reach medium risk on model autonomy the model doing things for itself many people are missing this but open AI are publicly warning us that soon the public won't get access to their latest models if a model scores above high then even open AI themselves say we won't even work on that model to oversimplify by the way the risks are its performance in hacking persuading people advising people on how to make Chemical biological radiological nuclear weapons and improving itself but here is my prediction based on past evidence from Sam watman and open AI they will water down these requirements can you imagine if open AI have a model that's a quote high risk for persuasion say and improving itself but not the other categories and open AI don't release it maybe they wouldn't if they were alone but if deep seek is releasing better models or meta would they really hold back on a release Dario amade the CEO of anthropic is almost openly calling for models to have the autonomy to self-improve amaday urges the US to prevent China from getting millions of chips because he wants to increase the likelihood of a unipolar world with the US ahead AI companies he argued in the US and other democracies must have better models than those in China if we want to Prevail the amount of investment in pure capabilities progress feels frankly frenetic at the moment Amad talked about boosting capab ities of models by spending hundreds of millions or billions just on the reinforcement learning stage and to put that in some sense of scale here is a study done by one of my patreon members on AI insiders he used to work at n research and you can see the sense of scale comparing deep seek R1 this is training cost only by the way 5 million and 01 around 15 million of course amade revealed that they spent around say 30 million on claw 3.