OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks

1.3M views1226 WordsCopy TextShare

Fireship

Let's take a first look at the new ChatGPT o1 model - a state-of-the-art reasoning AI model from Ope...

Video Transcript:

I thought it plateaued I thought the bubble was about to burst and the hype train was derailing I even thought my software engineering job might be safe from Devon but I couldn't have been more wrong yesterday open AI released a new terrifying state-of-the-art model named 01 and it's not just another basic GPT it's a new paradigm of deep thinking or reasoning models that obliterate all past benchmarks on math coding and PhD level science and Sam Alman had a message for all the AI haters out there two steps ahead I am always two steps ahead before

we get too hopeful that 01 will unburden us from our programming jobs though there are many reasons to doubt this new model it's definitely not ASI it's not AGI and not even good enough to be called GPT 5 following its mission of openness open AI is keeping all the interesting details closed off but in today's video we'll try to figure out how 01 actually works and what it means for the future of humanity it is Friday the 13th and you're watching the code report GPT 5 or qstar strawberry these are all names that leaked out

of open AI in recent months but yesterday the world was shocked when they released a one ahead of schedule GPT stands for generative pre-trained Transformer and O stands for oh we're all going to die but first let's admire these dubious benchmarks compared to GPT 4 it achieves a massive gains on accuracy most notably in PhD level physics and on the massive multitask language understanding benchmarks for Math and formal logic but the craziest improvements come in its coding ability at the international Olympiad and informatics it was in the 49th per when allowed 50 submissions per problem

but then broke the gold medal submission when it was allowed 10,000 submissions and compared to GPT 4 its code Force ELO went from the 11th percentile all the way up to the 93rd percentile impressive but they've also secretly been working with cognition Labs the company that wants to replace programmers with this greasy pirate jiggalo named Devon when using the GPT 4 brain it only solved 25% of problems but with gpt1 the chart went up to 75% that's crazy and our only hope is that these internal Clos Source bench marks from a VC funded company desperate

to raise more money are actually just BS only time will tell but 01 is no doubt a huge leap forward in the AI race and the timing is perfect because many people have been switching from chat gbt to Claude and open AI is in talks to raise more money at a 150 billion doll valuation but how does a deep thinking model actually work well technically they released three new models 01 mini 01 preview and 01 regular us plebs only have access to mini and preview and 01 regular is still locked in a cage Al they

have hinted at a $2,000 Premium Plus plan to access it what makes these models special though is that they rely on reinforcement learning to perform complex reasoning that means when presented with a problem they produce a chain of thought before presenting the answer to the user in other words they think Dart said I think therefore I am but A1 is still not a sensient life form just like a human though it will go through a series of thoughts before reaching a final conclusion and in the process produce what are called reasoning tokens these are like

outputs that help the model refine its step and backtrack when necessary which allows it to produce complex Solutions with fewer hallucinations but the trade-off is that the response requires more time computing power and money open AI released a bunch of examples like this guy making a playable snake game in a single shot or this guy creating a nonogram puzzle and the model can even reliably tell you how many RS are in the word strawberry a question that has baffled llms in the past actually just kidding it failed that test when I tried to run it

myself and the actual Chain of Thought is hidden from the end user even though you do have to pay for those tokens at a price of $60 per 1 million however they do provide some examples of Chain of Thought like in this coding example that transposes a matrix in bash you'll notice that it first looks at the shape of the inputs and outputs then considers the constraints of the programming language and goes through a bunch of other steps before regurgitating a response but this is actually not a novel concept Google has been dominating math and

coding competitions with Alpha proof and Alpha coder for the last few years using reinforcement learning by producing synthetic data but this is the first time a model like this has become generally available to the public let's go ahead and find out if it slaps I remember years ago when I first learned a code I recreated the classic Ms Doss game drug wars a turn-based strategy game where you play the role of a traveling salesman and have Random Encounters with officer hard ass as a biological human it took me like a 100 hours to build but

let's first see how gbt 40 does with it when I asked it to build this game and see with a gooey it produces code that almost works but I wasn't able to get it to compile and after a couple of follow-up prompts I finally got something working but the game logic was very limited now let's give the new 01 that exact same prompt what you'll notice is that it goes through the Chain of Thought like it's thinking then assessing compliance and so on but what it's actually doing under the hood is creating those reasoning tokens

which should lead to a more comprehensive and accurate result in contrast to GPT 4 01 compiled right away and it followed the game requirements to a T at first glance it actually seemed like a Flawless game but it turns out the app was actually pretty buggy I kept getting into this infinite loop with officer hard ass and the UI was also terrible I tried to fix these issues with additional follow-up prompts but they actually led to more hallucinations and more bugs and it's pretty clear that this model isn't truly intelligent that being said though there's

a huge amount of potential with this Chain of Thought approach and by potential I mean potential to overstate its capabilities in 2019 they were telling us gpt2 was too dangerous to release now 5 years later you've got Sam Alman beging the feds to regulate a strawberry is scary stuff but until proven otherwise 01 is just another benign AI tool it's basically just like gp4 with the ability to recursively prompt itself it's not fundamentally gamechanging but you really shouldn't listen to me I'm just like a horse influencer in 1910 telling horses a car won't take your

job but another horse driving a car will this has been the code report thanks for watching and I will see you in the next one