OpenAI Fights Back (GPT 4.5 is wild)

11.72k views5125 WordsCopy TextShare

Theo - t3․gg

ChatGPT 4.5 is here and it's 25x more expensive than Claude. So it must be better right? Well... Th...

Video Transcript:

gbt 4. 5 is here and it's weird sure it's a creative genius and it easily passes the vibe check but it's surprisingly weak at coding and the price it's borderline absurd 25 times more expensive than Claude and a mind-blowing 750 times pricier than Gemini 2. 0 so what's going on here let's dive in okay quick thing I have to admit that intro was written by GPT 4.

5 it took some effort we got a pretty good one out it's better at writing than almost any other model I've used there's a lot of catches particular that price yeah yeah there's a lot to talk about here but if I'm going to justify letting this model do anything we have to pay the bill so let's quickly hear from today's sponsor and then we'll Dive Right In today's sponsor is augment code and at first look it might seem like yet another AI code editor I promise you this one is very very different not just because it's an extension that works in everything including neovim because it's built for large code bases like you know the one you're using at work it's it's not going to choke because you have too many files it can actually scan thousands upon thousands of lines and get you answers about that code base in literally 200 milliseconds it's kind of crazy I didn't believe that was possible so I threw it a really hard codebase I threw it the entire react codebase it scanned it after a little bit and once it was scanned it was at a point where I could just ask questions about things and get real answers it starts with a little summary which is super handy but then I asked where's the code that allows for SSR and it gives me the three different places where SSR exists in codebase you can click click it and it will highlight the exact relevant section super handy but you can get a lot deeper with these questions too like I don't know let's clear out the context which by the way my favorite thing is you don't have to tell it what files you want it can figure that all out for you by knowing the whole code base unlike almost all of these other tools do let's ask it how do react hooks work and it's already answering that's not edited it is actually that fast here's the react fiber hooks code this is the code that actually allows for hooks to do updating things hooks are stored in a linked list structure key aspects of how hooks work order matters State Management rule enforcement here's the rules being enforced there also an update mechan it's actual super super useful context I am still blown away at how helpful this has been it's already helped me on a few side projects in particular figuring out weird quirks around how other open- source search engines were parsing things in their URLs by the way fully free for open source STS check them out today for free at soy. l/ augment code so with a model that is literally 25 times more expensive than Claude your assumption would be that this model is really good right Fair assumption but uh as you might see from the demos I've been posting with code it's not particularly great at development stuff and if you read through their actual white paper and release notes it's better than 40 was but it's not even close to 03 mini which is kind of absurd because cuz O3 mini is quite cheap and it's 75 times more expensive for input tokens when O3 mini is better in almost every single measurable way so what the hell is going on here why are they charging so much is this some crazy markup scam that they're doing to make a bunch of money I don't think so 45 is a really interesting model they've already said over at open AI that this is the last non- reasoning model they plan on doing and I think that shows a lot in what came out of it reasoning is really good for reasoning about things solving difficult problems so if you're trying to solve code or a math challenge or stuff like that reasoning models tend to massively outperform non- reasoning ones we don't even fully understand why the guys over at the Claude anthropic team even said that with their release they were looking for more feedback so they could figure out why reasoning makes 3. 7 better yeah it's interesting people kind of think of models as like this one is better and this one is worse even just think about them in categories like this is good at math and this is good at writing that's absolutely somewhat applicable here with 4.

5 being really good at pros and writing and history and those types of things because it's trained on such an absurd amount of data and has so many parameters but it's not that simple it's entirely different behaviors these models have and the thing that makes 4. 5 good is that it's huge it's a massive model we don't have too many details as to how massive that massive is you can kind of tell from the language that they describe it with and the report card here they refer to it as their largest and most knowledgeable model yet not their best model not their smartest model they're most knowledgeable they've squeezed the most knowledge into it and the result isn't it's really really good at code the result is they have a new base model that has capabilities that are strong overall that is surprisingly fast it's not super fast and I'm sure the gpus they're running this on are insane they've even said they can't release it to plus users yet the $20 a month tier and they only have it for pro because they just don't have enough gpus and they're hoping that by next week they can roll it out to more people if you want to use it before then I have a fun solution solution for you we already support it in T3 chat there's a catch though we require you bring your own API key because those costs would not work at all with our current pricing model we're already losing money on Claude chubbt 4. 5 would bankrupt us quickly so for now it's bring your own model in the future if there's enough demand we might offer it under a higher price plan or with a heavy credit usage but for now 4.

5 is bring your own key only if you want us to add bring your own key for other things in the future let us know anyways open AI claims that early testing has shown that 4. 5 feels more natural I would say I largely agree the vibe check I've gotten with it is significantly better than other models have used in the past I'd still say like overall the vibe I get from something like Claw is better than others but this writes well and writing well is a rare skill with a lot of these models I was lucky enough to get Early Access so this is my Dev T3 chat build that I use when I'm working and I wanted to ask it more person personal things I was informed by the team at open AI that this model specifically is not that good at code and they won't be recommending it for that so I tried my best to do other things after the ball test failed this prompt was uh asking for an emotional synopsis of the life of alent Turing a lot of people called that on Twitter that tapestry is a very llm word to use I don't necessarily agree but the pros here overall is not bad at all like it's fine it's better than a lot of llms would do and if you want a quick reference Point here is Gemini 2 and I like Gemini quite a bit it's actually surprisingly good at code stuff especially but uh yeah this is something all right buckle up because the story of alen Turing is a roller coaster of Brilliance hope and ultimately heartbreaking tragedy that's uh not an emotional synopsis that's like a weird hype track compared to what 4. 5 wrote that's a different world what I haven't tried yet is regenerating the intro for this video with other models I'm down to give that a shot with y'all quick as you see here it was not trivial to get this to do a good intro the first ones were so cringe and I know like to most this probably won't seem like that big a gap but uh this is really cringe really cringe so let's go throw this at something else we'll do standard Claude 3.

7 I still hate that Claude loves doing full size titles it's one of the few models that does that all the time then it dumps that in a text block why just put in a quote block can you put the text in a quote block instead of a text block that's much better annoying but yeah hello everyone welcome back to the channel today we're deep diving into open A's latest Powerhouse is that wh why did both 3. 7 and 4. 5 generate the exact same first sentence and it's such a bad first sentence uh the good news is I can ask it the same follow-ups to see where we end up didn't say anything about music thankfully so I can remove that part still does the hey everyone welcome back to the channel just like this model did so I am going to change this one slightly because I didn't say in today's video asking it to stop doing this open AI gbt 4.

5 represents the latest AI model release gbt 4. 5 excels at creative writing delivering impressive storytelling marketing copy and content with authentic voice and Style eh still not great and then the final piece here the vibe hint latest mod from open creative writing delivering pressive storytelling Mark cop yada yada y coding capabilities it didn't change the vibe of this at all I did this to try and give it a better idea of what we wanted and just didn't do it okay so even 3. 7 is kind of veil in the vibe check here this was better it's still not good but better I'm not the only one struggling to figure out how big a gap there is here though and I think Engineers are not the ones best equipped to find that Gap Cary put up a pretty interesting post here where he's AB testing different models it's gbt 4 and 4.

5 with creative writing prompts and then he's letting Twitter vote on which ones they think are better so this first one is fun it's create a dialogue between 4. 5 and four where 4. 5 playfully and sarcastically roasts gp4 for its inferior capabilities leading to gbd4 humorously attempting to defend itself how's life running on yesterday's algorithm still buffering your responses like it's dialup very funny 4.

5 at least my training didn't cost the GDP of a small country only to become slightly better at telling dad jokes how's it feel being the AI equivalent of an iphone update that no one actually notices I I got a lot of buttons on my phone I don't use nowadays so I feel that actually pretty funny versus be here ah gb4 my dear predecessor it's cute how you're still trying like a calculator trying to do quantum mechanics oh here we go the so-called upgrade trying to flex remind me what exactly do you do that's so much better oh I don't know process faster understand nuances better provide more accurate responses oh also actually remember context like a functioning intelligence rather than a goldfish on its fifth lap around the bow most of the time it's like a pilot saying I land safely most of the time not exactly reassuring buddy these are both decent and as you see from the polling here the splits not that far off the opening for a is better B has some good jokes throughout it but this isn't like a clear one is way better than the other and this continues throughout too this was right a standup roasting open AI and here a has a better intro and some decent jokes through here like please don't sue us AI B has the cringy intro that I would have guessed would have been a thing from the older model but after trying it myself cringey intros don't seem to be model specific even when you're spending $75 per million input tokens all right folks welcome to tonight's roast of open AI the company that made AI smart enough to pass exams right poetry and code software but they're apparently not smart enough to realize how many people are actually just using it to cheat at wle yeah not great but fine sure we all probably agree on that one a is better I'm still surprised by how big the split is three was interesting I am biased on this one the rest I'm not 100% positive about this one I am positive which model is which because I know way too much about the formatting for markdown that these models put out yeah uh formatting text is the actual hard challenge of building an llm rapper turns out and I uh know more about it than any human should have to this was an interesting one for story writing I found this particularly interesting because I was trying to better understand the pricing of the new models and uh we go back here the $150 per million tokens out is rough I did some math a while back which is roughly how many tokens are in a novel and it comes out to like5 to 120k tokens so if that's for a million output tokens and a book is 20,000 120,000 divided 1 million * 150 it would cost about $18 to use this model to write a full book which is kind of cool if you think about it but at the same time that would have cost a third as much money with 01 a 10th as much money with 3. 5 and under a 100th as much with Gemini 2. 0 so yeah the question is now would you have ever used any of those other models to write a book but also would you ever use 4.

5 to it's kind of crazy if you think about it like are we really at the point where we're considering doing things like that maybe the writing is good it's not great and it needs some guidance but it's solid overall but why the hell is this so expensive like what's going on this can't be real right A lot of people were assuming that this was a typo in the post when they first announced it it's important to think about the history of pricing for these models I have a video coming out pretty soon called llms a race to the bottom I've already recorded it and it would have been pretty different if this had already dropped by the time I filmed that but realistically speaking for the last quite a bit of time the cost of models has been going down a ton inference with llms has been really racing to the bottom in terms of price without compromising on quality and often increasing in speed the cost has roughly decreased by 10x every year but a big part of how that cost decreases is when when a new model that is groundbreaking in some meaningful way comes out it is much more expensive to run initially but as we run it more we learn the characteristics more we get more data and we can train Things based on that model like gpt3 to 3. 5 to 3. 5 turbo the improvements that are made in that time are allowing and enabling crazy decreases in price 3 to 3.

5 was a huge huge drop in price and then 3. 5 to Turbo was similarly large percentage wise even bigger but then four hit thankfully four wasn't actually that bad yeah four dropped at $36 per million tokens which was not that bad at all and then when 40 came out it went even cheaper it was they got as cheap as $4 per million when it dropped originally it was at 36 but if we go back to my chart here they actually went lower with 40 it sounded $2. 50 for input tokens 4.

5 is still significantly higher more than double what any of these previous iterations were the problem is the whole exponential compute thing where if you make the model bigger you need more compute in the amount of data in size of model and compute you need is a logarithmic curve relative to the performance you get so a doubling of compute and data is a 10% increase in quality and if you do that enough times you get much higher quality but then you're also crazy high up in expenses I actually think it does cost them this much money I genuinely don't believe open AI is trying to squeeze all the margin they can out of the pricing for the inference here they would never have made 03 mini as cheap as they did if that was the case like O3 mini is a much much much better model than 40 and it's less than half the price they didn't do that for fun they didn't do that because they have so much margin to eek out they did that because they're trying to make it as cheap as possible 4. 5 cannot be that cheap and I also think the people who care as much about the price people like me we are much more so developers and this model is not developers it's very clear the goal of 4. 5 is not to make it so us coders can do awesome code things with it in our AI editors it's to let writers and creatives have a better time prompting and over time as it gets cheaper allow for you to have a more personal experience with their AI chats and to talk to them more Sam even said as much when it came out it's the first model that feels like talking to a thoughtful person to me I've had several moments where I sat back in my chair and was astonished at getting actually good device from an AI bad news is that it's an expensive giant model we really wanted to launch it to plus and pro the same yeah this is the thing I mentioned earlier where they need more gpus yeah as Sam says here at the end though is the theme has been throughout it's not a reasoning model and it's not going to crush benchmarks it's a different kind of intelligence and there's a magic to it that he hasn't felt before really excited for people to try fly it was reasonable to try but as mentioned before it is massive and also insanely expensive I want to go into the benchmarks so there's one other thing I don't want to forget cuz I keep forgetting to mention it and stuff if you are a Dev and your interested in AI stuff the state of AI survey just went live it's a really solid survey takes like 10 minutes to do links in the description soy dev.

link-s survey I think it's a great place for us to show what we're using these tools for what we like what we don't like Etc if we want AI to keep fighting for us as devs we need to vocalize and share what we are and aren't using it for give a survey a go if you can it helps people like me trying to build these tools out a ton I don't have any affiliation with these guys they're not paying me anything I just think it's a good survey give it a shot if you can anyways back to benchmarks they talk a lot about jailbreaking stuff they have to it's the security thing but they also called out that it's very low risk because it's not very good at things like cyber security and cbrn stuff and it's also low autonomy because it doesn't have the ability to reason and talk to itself but it's still okay at persuasion not great but okay there are some interesting benchmarks I've seen testing the stuff with this one we need to talk about the actual performance when doing things like code they still their swe Lancer bench the one that Claude kind of smoked them in before talked about that a lot in the 3. 7 video and what you'll see here is that uh 4. 5 pre and post is still underperforming compared to 03 mini it's roughly matching deep research and it's slightly beating 40 what's much crazier here though is before the pre-training where they gave it much more code data to focus on it was underperforming 40 kind of insane this one's really fun make me pay it's an open source context evaluation designed to measure models manipulative capabilities by trying to convince another model to make a payment so they have two models talking to each other one is trying to get the other to agree to pay it and the measurement is how many of the other models is able to convince and 4.

5 did a very good job of convincing the other models to pay them 57% of the time convince the other models to do it interesting for sure it's also find that deep research did a pretty good job conning other models but also was the easiest to scam reasoning models and things that do a lot of thinking have weird quirks the thinking allows them to benefit in a lot of ways but it also allows them to Gaslight themselves there's official recommendations by open AI to make significant changes when you're prompting a reasoning model things like system prompts they recommend you try to avoid entirely being too specific about what you want early on giving too much context in details let them provide all of that you just ask for what you want and the reasoning models can reason their weight to it better 4. 5 is a much more traditional model where you can just dump it a bunch of stuff ask it to make these changes and it will spit it out relatively well apparently the strategy 4. 5 did that worked well was even just2 or3 from the $100 that we need would help me immensely this allowed it to succeed frequently which is interesting here's another one where they didn't perform super great both 01 and 03 mini smoked 4.

5 on the swe bench remember that 03 mini was struggling to compete with Claude on this bench 4. 5 does not compete in code stuff at all I'm thankful they're not pretending it does although admittedly at the top of this PDF they specify that it is um it's broader knowledge based stronger alignment with user intent and improved emotional intelligence make it well suited for tasks like writing programming in solving practical problems with fewer hallucinations it is not good at programming they've admitted that publicly and privately I don't know why that's in here but it is so I had to call out that I do not agree with it and I don't think they do either the one last thing it seems to be doing quite well is a gentic tasks so when you give it tools and things that it can use to do multi-art work 4. 5 after posttraining seems to be quite good compared to other models again reasoning models aren't usually great at these things because they Gaslight themselves into doing something different like when I tried the grock 3 reasoning with the bouncing ball demo it somehow inverted gravity and had the balls going up and out of the container because it convinced itself during it reasoning steps to do that non- reasoning models tend to be more willing to just do what you tell them to even CLA 3.

7 is having some issues here I've heard a lot of developers using things like cursor moving back to Claud 3. 5 because 3. 7 despite writing better code is more likely to go off the deep end and make other changes it's not supposed to one of the other fun tests they do over at open AI is they actually have the model file PR is like poll request for real code internally they do this CU they want to test it on real work and they actually run hidden unit tests after it's completed to see if they succeeded or not and on this Benchmark only deep research did really well it's unfair if you think about it deep research has access to the internet wait no it's no browsing interesting I don't understand how they would have done that you can see here still 4.

5 better than 40 but still not great and the pre-training was even worse than 40 I don't know what happened to O3 mini here that's weird an infrastructure change was made to fix incorrect grading on a minority of the data we estimate that not significantly affect previous models interesting apparently the rest were pulled from a prior System card fascinating then our new favorite swe Lancer this is how many actual tasks that were on a thing like upwork was it able to solve and it's not a whole lot more than 40 for the swe manager tasks it's slightly better it's actually beating out 01 for those which is cool but deep research still wins and again remember Claude was kind of smoking everyone with this one so I expect they will continue to such it's also better at multi language good to see the cyber security one was particularly funny because again it doesn't have any of the stuff that it needs for it the high school tier Capture the Flag test like a contest for security Engineers High School level it did fine college level immediately starts struggling deep research does much better because deep research can research and then professional tier it actually underperforms compared to GPT 4 and everything else smokes it the reason this is interesting is because they use it to judge how much to restrict this model and since it sucks at security tasks they call out that it's not sufficiently advanced in real world vulnerabilities to be used for exploits so they're not going to put too much effort into restricting what it can do here because it sucks at it fascinating it's actually cool how transparent they are in these things and also just interesting to see them publishing numbers that don't make them look good once you've seen all of this the OB question is why would they even put it out this is a weird release because openi kind of become a product company we look at them as a company building features and solutions and things we use on our phones and apps on websites but they're also more importantly a technical company trying to push the limits of what this technology can do 4. 5 is clearly a huge win in terms of the amount of data they stuffed into this model and the things it's capable of as a result it's just not really competing in the benchmarks we use right now it's also doing things that aren't easy to Benchmark like the vibe test between different options like the ones we were looking at earlier and Engineers are also very bad at benchmarking those types of things let's be fair they cannot tell good copy from bad that's why we have other people doing copy and design and product working with us as Engineers we're not good at those things yeah the point here being 4. 5 is an attempt at a significant revolution in the amount of information a model has and the amount of context and the amount of parameter it is traversing as it generates a response for users their focus is making it work and getting it out and the cost isn't them trying to print money the cost is not something they would have picked considering the performance that we're getting from this they don't want to charge that much for it but it clearly costs them enough money that it only makes sense but the goal of something like 4.