o3-mini is really good (but does it beat deepseek?)

39.61k views4813 WordsCopy TextShare

Theo - t3․gg

OpenAI just released their new reasoning model o3 mini, with some very clear responses to the crazy ...

Video Transcript:

today's the day it's finally here 03 mini the first model from open aai that's actually reasonably priced and also reasonably good there's a reason that this all came out and yes I'm using reason a bunch because it's a reasoning model I can't help myself but the thing that made this release as interesting as it is isn't actually open AI work at all it's another company deep seek out of China it really seems like they built this model to compete with what's going on over in China and it shows with both the pricing and the performance you can get but also with some of the weird quirks and how they implemented it this is going to be a real deep dive where I'm going to go from showing the performance of the models the price of the models how these things compare as well as the strengths and weaknesses when you're using them for real I spent the whole day playing with O3 mini and it's been a wild ride I'm super excited to show you guys everything I've learned but first a quick word from today's sponsor today sponsor is raggy and you AI Bros definitely want to hear this one because they made it way easier than it's ever been to connect basically any service to your AI apps so if you have data in something like I don't know Google drive or slack or Salesforce or comp fluence or even notion they'll handle everything kind of plaid style where you can sign in with any of those apps in your app and now you can connect that data and index it in your AI applications so if you're building a chat and you want to be able to chat with your notion data or your Google Drve data or someone else's this is the easiest way to do it everything from off to indexing handled for you and the code you have to write is so comically simple you can just curl their endpoint with your API key and tell them what you want to index you can hand them a PDF directly if you want they'll chunk it index Mi it all for you and then you can retrieve it with a simple fetch request this stuff was not easy for us to set up in T3 chat there's a reason we're still adding things like PDF reading this will make it super easy for you to build great rag experiences in your existing AI apps or if you have a new app idea this is probably the right way to get started with it too huge shout out to ragy for sponsoring today's episode check him out today at soy. link ragy diving straight in we should probably start with the pricing I went and took all the most common things that people tend to use and put them in this quick chart so we can compare chpt 40 and this also the input and output is per 1 million tokens I'll do that to fix formatting so whenever you have 1 million input Tokens The Token is roughly a word so to speak 1 million input tokens on standard chat gbt 4o API pricing is $2. 50 outputs 10 bucks 40 mini 15 cents output 60 cents that's why we offer 40 mini as a free option in T3 chat speaking of free tiers by the way we're actually temporarily offering 03 mini on our free tier so if you do want to try it out and use the best of the best current models it's worth a shot and if you do want to try all the other models too it's only eight bucks a month for very very high message rate limits on all of the models that we offer which right now is pretty much everything you'd ever reasonably want to use so yeah 40 mini very very cheap 01 not so cheap 15 bucks per million input tokens 60 bucks per million output tokens that's why 01 is rough to use and why you don't see it in too many services there's one particular price I do want to call out here though CLA CLA is insanely expensive especially when you consider the fact that it's not even a reasoning model I am floored at the price and to get away with this it's by far our biggest expense for T3 chat is just paying anthropic for cloud usage deep seek V3 27 cents in $110 out V3 old because they're changing the pricing this is the pricing that I got so excited about V3 was not their reasoning model it's their funny enough it's actually really close to Claud from its performance and characteristics from my experience but their old version of the pricing was hilariously cheap 14 cents in 28 cents out it's actually cheaper than 40 mini with quality levels comparable to Sonet it was unbelievable it's going to be a bit more expensive coming up soon and I understand why their apis have been hammered because of how cheap they made things and also because R1 blew up speaking of R1 55 cents in 219 out puts it at cheaper than 40 or son it while working closer to how 01 works but we are missing something in here we don't have 03 so what is the price for 03 mini 03 Minis input price is A110 and the output price is $440 that is incredibly competitive this effectively in my opinion makes for not really worth touching it makes the value pitch for 3.

5 Sonet incredibly weird and arguably quite weak and it also positions 03 mini really really well against deep seek R1 all we're measuring here is the price there's a lot of other things we need to measure there are incredible services like artificial analysis they do a great job of comparing different models and their pricing but they don't have 03 on here yet they're also much more focused on traditional benchmarks things like the mlu the gpq I have my own things I like to use to test these models out though one in particular that I've actually enjoyed a lot is a little programming challenge I enjoy Advent of code Advent of code is a set of Christmas themed problems that come out at the end of every year in December you get one at around 9:00 p. m. Pacific time every night it's a super competitive really fun set of programming challenges I compete actively every year and last year I actually had a really good year you see my numbers here I was in the top 1,000 almost every day I should thank cursor for a good bit of this it helped a lot with writing things I didn't want to you know the classic like traveling salesman type stuff I don't want to ever write dickas again thank you cursor for making that something I didn't have to do some of these problems were brutal though in particular the ones that hurt me the most which you can see by the hour plus they took book where day 17 specifically part two day 21 and day 24 so what I have done is I took all of these problems days 17 21 and 24 and I prompted Claude 01 Pro 03 mini and R1 to see how they would perform I thought this would be a fun test and ended up being more fun than I expected not because the answers that they gave codewise were great but I learned so much more about how broken all of these uis are see this chat PT thread see that uh funny hourglass thing there that's because this O3 mini High prompt has just been sitting here for probably like 30 minutes now this is all just one prompt by the way it just kept reasoning kept reasoning and then just silently died never even updated the title it's bad it's so bad and I resubscribed to Pro again so I'm on the $200 here so I could prompt o not just 01 but 01 Pro the super fancy super expensive model they don't even let you hit via API oh God their UI is so laggy you see how long it took for that tab to change I clicked I'm clicking now it takes like half a second and then does a terrible scroll to the bottom once you oh God that's so bad how do they build it like this we don't have any of these problems with T3 chat you click and it immediately goes where you are clicking no weird scroll movement no weird rendering so I used T3 chat as much as I could you can see that because when you hover now it says what model you generated with nice addition makes this a lot easier for us to figure out and test so what were the results let's start with looking at day 17 so you can understand the problem this is a fun one part one was an interesting three bit computer problem let me try and remember exactly what it was oh yeah so you have these three registers and you have the program you have to determine what the program is trying to Output which is going to be a set of numbers you're given these operands uh 0 through three represent literal values fours a 5 is B 6 is C seven does not appear in valid programs this is so that you're building a three bit computer basically it's a fun problem they love doing things like this let's see how the answers came out though part one this is my answer 01 got it 03 got it R1 got it Claude made a real fun hallucination I'm going to command Z till I have it back it was hilarious here it is it hallucinated Windows believe it or not your window does not have a node compatible file system on it so had to fix that but once I fixed that hilarious hallucination we were good part two nothing got R1 ran it indefinitely and Claude had the same hallucination but it failed because it set a cap on how many times it should run because it assumed for part two that it could eventually just find it under a certain reasonable range clearly does not understand EV of code very well because part two here was what's the lowest positive initial value for register a that causes the program to Output a copy of itself this was a brutal part two that took a very very long time to solve with a ton of caching and I don't remember how I solved this one it was a while ago but yeah none of them could solve that there's a reason it was such a hard problem makes sense how about day 21 day 21 was a really interesting problem it has this keypad and instead of the keypad being a thing you just press as a human there's a robot that is pressing the buttons and you tell the robot which button to press by giving instructions up down left right to move its finger up down left or right and then press to press there's a couple catches though if it ever hovers its finger over this empty spot it fails so you have to make sure all the instructions you give never put it here and catch two is you don't control it directly you control it via another one of these pads this pad is on the back of the robot then there's another robot behind it controlling that then another behind that so you have to like expand the instruction set from whatever the numbers are like if you wanted to hit 852 then that is to start with the robot at a so that has to go up left and then hit eight but then the robot behind that has to hit up on the robot and it hopefully you're getting the idea this is a really weird problem and it's a very fun one but it's also a very hard one which is why none of them got the answer one interesting thing I noticed is that Claude just straight up ignored my prompt my prompt specifies in it to grab the input from a file named input.