are we cooked w/ o3?

44.56k views3019 WordsCopy TextShare
ThePrimeTime
Twitch https://twitch.tv/ThePrimeagen Discord https://discord.gg/ThePrimeagen Become Backend Dev: h...
Video Transcript:
hey I just wanted to take a moment and talk through 03 in the release because I know a lot of people have been just scrambling about what to do with this the metrics look crazy code force is it's like Top of the World on code forces apparently this Ark AGI Benchmark 03 is just crushing through where other benchmarks have failed so how about we take a look at it and I'm just going to kind of go through what happened and then on top of that I'm just going to kind of give you my predictions about
what does this mean for you now and what does this mean later okay cuz I think it's pretty important to kind of understand this so the first thing that obviously happened is if you opened up Twitter in the last couple days you're going to see just a bunch of BS tweets like this AGI has been achieved internally open AI says their AGI AGI AGI it's start it's beginning to look a lot like AGI right and all of this happens because of this test right here this is called The Arc prize test right the arc AGI
public and semi-private test in which are going to be a bunch of little logical puzzles hoping that if you can solve it as an AI this makes you AGI and even though they do claim that hey this isn't a uh 100% if you solve it it makes you AGI it's a litness test in some sort to say that you're on the road or close to AGI if you look at the results you can see right away that 03 looks like it did massively better than 01 and even better than 01 mini and it effectively did
by all comparisons of at least the percentages this is looking looking absolutely fantastic and if you look at the resort results down here it scored on the private 75% and on the public 82% when it comes to low compute on the high compute side it did even better approximately what 12% better and 20% better on these two with the high compute side so what does one of these tests look like I think it's really important to see one of these tests if you go all the way down here here's a test that had failed effectively
you see this gray square and the cyan Square both have these like little pokies coming out of them where the other three squares don't and you'll notice that it moves over by the amount of pokies if you do it again cion should move over four it does red should move over six it does and then it asks the question okay what happens here and unfortunately the ym cannot figure that out for us this takes what 5 Seconds to realize what's happening here maybe 10 seconds it's pretty easy for us well a lot of compute and
a lot of time and a lot of energy went into and it was able to answer what was it 82.8% of them so pretty good job uh pretty impressive and honestly when I first started Ed out programming I remember making my first AI this is when I was getting my masters in Ai and as one of those games where it's one of those weird puzzles where you only have like one open space and then you have to shift tiles in to try to reorient the puzzle to create the picture again because it's been mixed up
and I remember attempting to kind of create an AI to do this and it stumped me and it took me a long time to really figure out any sort of progress on it and so seeing these kind of results is majorly impressive to me because I know how hard it is to make a generalized solver like this that can go into any situation and be able to do pretty good like 75% is very impressive so what does this mean for you well before you answer that question you need to understand the cost what is the
cost the cost is approximately 1.3 minutes on low compute for $20 per tasks now it's claiming that this thing costs $5 per task for a human to do I'm just not really purchasing that is the case um I don't know how they're exactly uh qualifying that statement it kind of seems a bit ridiculous well either way with the cost being $2,000 to solve these 100 semi-private tasks or $6,600 to be able to solve 44 or 400 tasks that seems like a lot of money that means even just in a me and you using this kind
of way for our daily job this 01 low compute is going to be in the order of tens of thousands of dollars per month likely just to be too expensive to use it in any sort of realistic way you definitely are going to have to bounce it off other models and all that and then on top of it the high compute one which has that improved accuracy is 172 times more compute I don't know what that equates to dollar cost assuming it's one for one that means that thing is going to be a 100 what
is that that's $140,000 to be able to solve at an extra 15% more accuracy like that's a lot of money like that's an incred incredible amount of money for this thing to be able to do so effectively the high compute is just not even tenable for anything and then on top of it you have to ask yourself this question okay it can solve a toy little task that it's never seen before well maybe we can make some sort of Base assumption that okay small toy tasks can actually be projected into solving some amount of coding
in a large code base okay I could see that being a thing like maybe since it can solve that small one it can kind of go into here and kind of gauge what needs to happen and how it needs to happen well the big problem is is what does a single bug look like what happen if you have 10,000 files how much larger is one task is it effectively the same thing as calculating out 400 public small task because really you have to determine the files the context re the chains of thought the huge large
codebase like all that goes together is it going to cost you like5 $6,000 per feature SL buug to comp to compute it's likely Way Out Of Reach and so I think for us right now ai is effectively 10,000x too expensive to be really used as a means to just kind of help you speed up at least this this AI something like Devon is still very very expensive because you only get 62.5 hours of compute for $500 a month now yeah it can solve some small bugs and when it does small solve those small bugs you
still have to look at the bugs you still have to actually fix the bug fix themselves because often it is still wrong because that's just llm coding baby and so you have to fix those so not only are you paying a lot of money for not that much compute time you actually can't even really speed yourself up because you still need to handhold it you need to understand the change you need to actually go through so like who would have been faster well that's a whole John Henry question I guess we have to ask you
could theoretically have four bugs being fixed and then you just turn into a permanent code reviewer trying to shove these things through as fast as possible and it may cost you what $2,500 a month to be able to have that much Devon compute time you know one could argue that might be a bit expensive and so entire markets are just completely priced out of today's AI let alone tomorrow's AI which is going to cost way way way way way more and so the you know that's how I'm kind of looking at it right now which
is it's cool it can do a lot of stuff and even if you can practically have all the systems hooked up like something like Devon it is way out of control as far as pricing goes and not only that there's also an entire Hardware problem if all of us decide tomorrow that we all want to use 03 and we all have the money to use O3 there's no Hardware available for us like an entire group of people can't do that the hardware still has a whole limitation for how many people out there to be able
to actually do the Computing and not only that but it's a lot of compute power you may not realize this but it's estimated that a single task to be done with 03 High compute is the equivalent of five gallons of gasoline like yo dog I looked at that puzzle and I used what a thimble of water as far as energy goes I use practically nothing to be able to solve it whereas this thing is going to use an enormous amount and potentially fail 20% of the time you know I'm just saying I I don't know
I'm not really seeing it and so I just wanted to say all this because I really don't think that we're entering into a world where hard skills aren't going to be needed if anything I think hard skills are going to be needed more and more and more and more because the speed in which things can change are only going to go up but the technical depth isn't going anywhere right oh wow you got a little bug fix coming along the way what does that actually do how does that actually solve your problem is it actually
the solve you want how does it integrate with the system is it going to be the way you want or do you want to completely solve it a completely different way there's going to be so many little questions and answers and all that they're going to have to be done so when you hear these things AGI achieved all this kind of stuff just remember what it means to you is it's probably not going to mean a whole bunch it's probably just simply going to mean life as usual if you use AI today to be able
to solve your problems hey great probably not going to change a lot for you over the course of the next year are you actually going to be able to speed up your work maybe if something like Devon got 10x to 100x more cheap and you were able to actually get a whole bunch of Devon solving and only have to pay maybe like 50 bucks a month could it actually be useful I could actually see that being used by a lot of people um so I'm just not like fully on board yet that we're in some
sort of crazy new uh cycle I think most of this honestly I think it's just investor grabbing right now open AI they need to do some funding they're doing a whole bunch of stuff right now this just sounds like a good oldfashioned funding talk right like why would you even publish that and and the reason why I kind of grab a lot of this or I kind of gravitate towards that is I want you to look at this what what are you missing from here why does it say the word tuned well what it's trying
to say to you is that the public data set that was available went into 03s training data now it may not be finely tuned to work against the arc AGI stuff but it was trained to some level against the arc AGI stuff which means that even when trained against it it's still only doing a certain level of accuracy so there's something to be said here why didn't they show the 03 not tuned model the one that didn't have any of the data in there my bet is that it it just didn't do nearly as good
they talk about their you know that they did it also none of the 03 mini stats have been released is it actually going to be great are you actually going to be able to get anywhere 01 is still pretty expensive and so what is a massive cost reduction is it half well that's not going to be like that's not that great you can't have hundreds of tasks running yet at 01 it's still pretty expensive so it'll be it'll be interesting to see where this all goes I don't actually know where it's all going to go
and I think at the end of the day I don't think any of this stuff matters in the sense that it doesn't change what you need to do you need to continue to grow that Gap because right now what I see in the world is this ever growing gap of knowledge people who started before a lot of the Revolutions of computing and making things easier and people who are just getting into the field today where everything is abstracted and easy and this Gap right here is constantly being filled with like say AI or really useful
services that take care of everything so just everything serverless you know Spa application and you just don't really know how anything works and so you have people that know how things work that can move and are Adept and people that don't know how things work and they can effectively crutch them through or crutch themselves through with AI and I think that if you can figure out how to cross this this gulf you are only going to become more valuable because as AI becomes more and more integrated and as code just changes fast faster because let's
just face it that's what they're going to sell us from Silicon Valley because that's what's going to happen code change and code churn is going to go through the roof I don't think that's a good thing but I think that's going to happen people that understand the the real technical depth I think are going to become magnified in their value whereas people that don't really understand what's happening they're not going to grow much at all as far as value goes because they simply haven't been figured out how to cross this giant Gulf and they don't
really have the skills to be able to wield the AIS in a super and highly effective way that's also assuming a lot of things right like I'm also assuming that 10K that 10K X reduction right that's a big assumption like a 10x Improvement in efficiency that would be incredible 100x I mean that's insane a th000 X like that's outrageous right but that's still needed just to even get remotely close to making it useful and then you need like up to 10,000 to 100,000x it's pretty wild anyways so a lot of things to consider and I
think a lot of people have sold you I think a lot of people have just sold you a bad bill of goods right they want everybody to freak out and have to attach their product because they know their product makes money and at the end of the day the more they can shove this down into every single place the more you have to use it the more it's Justified to be used and so that's why you see the classic Scare Tactics right if you aren't using this you're being left behind says who why are you
being left behind why is understanding something at a deeper and better level why is that making you uh left behind because the reality is you turn on AI you instantly get those suggestions and all of your experience is immediately applicable so last time I checked it's not like you should uh what's it called it's not like you need to be able to be training on AI for years to be able to use ai ai for anybody is instantaneously good that's why all these people who can't program can start using it why because it's easier to
to use so you with technical depth and breath oh wow yeah it is actually in fact easier to use so don't stop listening to that stuff stop getting overwhelmed by it honestly pick a language pick a project start doing it you can just do things you can just build it you can enjoy what you're doing don't have analysis paralysis and try to choose the best thing at all times it does not matter that much build it with larl build it with nextjs build it with just go and rawog and HTTP server do it with just
TCP it does not really matter cuz at the end of the day you choose your learning you choose how you want to go about it and if you can enjoy it and keep on doing it I guarantee you in 5 years those skills they're probably going to be very very applicable especially as the sea of stupidity stupidity and the Sea of constant change by AIS really gets adopted and we get into the mess that that's going to create over the next couple years anyways there you go that's all I wanted to say I know it
was a bit rambly the name the name the name is the primagen I got to stop it now
Copyright © 2025. Made with ♥ in London by YTScribe.com