Claude 3.7 SUPER CODER... With One Big Flaw?!

81.91k views1812 WordsCopy TextShare

Matthew Berman

Claude 3.7 is here! It may be an excellent coder but it sure is lacking in other departments... Have...

Video Transcript:

CLA 3. 7 Sonic was just released and I just got done testing it I built a complex snake game that allowed two AI snakes to battle each other I added a superfood that creates a block that can destroy one of the snakes that actually moves and follows the snake around and all of this was done on the very first try and I'm going to show you more about that later in the video but now let me tell you a little bit about CLA 3. 7 Sonic so two things were actually released just now we have clad 3.

7 Sonet which is a big but still a DOT upgrade to the Claude series of models and then we also have clae code which is a command line interface for agentic coding now for Claude 3. 7 Sonet it is a thinking model and this is the first thinking model by anthropic I am pretty surprised that this is not Claude 4 and I find it a little bit weird that this jump is from 3. 5 to 3.

7 versus just straight to four which makes me think four is in the works and it's going to be much much better but we don't know that for sure but what we do know is that this minor version increment is a big jump this is the first hybrid reasoning model on the market that means CLA 37 is both capable of generating near instant replies to whatever prompt you have in the more traditional llm way and it also has thinking so it can take its time using Chain of Thought before replying to you very similar to O 1 03 and grock 3 but both of those come from a single model now just like other thinking models Claud 3. 7 has a scratch pad in which it's doing Chain of Thought So it's actually iterating on its thinking it's reflecting it's trying different potential results and then finally summarizing everything or kind of choosing the best one and then showing it to you and they actually do show The Chain of Thought which I thought was surprising because anthropic is kind of known for being really close s and very big on security now whether or not they're actually showing the True full Chain of Thought I'm not actually sure but it kind of does look like they are and if you have API access there's actually a dial in which you can tell Cloud 3. 7 how long to think for and you can actually specify the number of tokens up to the context window maximum which is 128,000 tokens which is definitely on the smaller side of context windows so as an API user if you're building API applications and you're using Cloud 3.

7 Sonet to power it all you do want to specify how many tokens maximum so that you just don't blow your budget overnight let's look at some of these results so this is sbench verified here is Claude 3. 7 Sonet this is a 20% increase versus the other models listed here this is cloud 3. 5 Sonet new 0103 Mini high and deep seek R1 all four of these models come in right around the 49% and then with Claude 3.

7 Sonet we reach 70% now there's a caveat here this kind of lighter pink area says with custom scaffolding that just means they use customized Chain of Thought techniques and kind of wrappers around that to optimize for their specific model and so without the custom scaffolding we're still getting a nice 12 plus% increase in performance but with that custom scaffolding we reached 70% now it's also really good at a gentic tool use and so that's what we're seeing here here's tww bench for retail and tww bench for airline and so these are both real world tasks where an agent is tasked with going and interacting with an API like a retail API or an airline API and so as we can see here Cloud 3. 7 Sonet beats 3. 5 and 01 in both instances and so right now Cloud 3.

7 is state-of-the-art now on some of the more traditional benchmarks although these are all very hard we have GPT QA diamond we have multilingual Q&A visual reasoning we have math 500 the Amy 2024 and Claud 3. 7 with extended thinking is very competitive with the top models out there which include grock 3 beta and 03 mini with high thinking now these thinking models Ace my rubric and it's time to retire them officially it's been a fun run but they are retired and we get to create a new one right now Alex and I are in the process of creating that new rubric but in the meantime I'm going to try out a few new tests in this video to really push CLA 3. 7 to the Limit and if you have any great suggestions for tests that I should put in this new rubric let me know in the comments below all right so this is Claude cod's research preview it's actually really easy to install I'll link to the installation instructions down below it's like three steps so I'm going to be honest I tested grock 3 in the middle of constructing this new rubric and I didn't really push it as hard as I could have and I know a lot of you mentioned that in the comments so in this video I'm going to be comparing some of those tests to grock 3 and 03 mini to see how they stack up so obviously CAD 3.

7 can easily create the snake game here it is it took just a number of seconds really it was very very fast and works perfectly well but that's not all we're going to evolve it now all right so the first thing is let's just have AI Control the snake itself let's see how easy it is to add that so make an AI that controls the snake now the one thing that I don't like about this is the fact that you can't actually see what's happening as it's thinking or as it's writing code only at the very end where you get the output you actually get to see the changes and speaking of here we go so we can see all of the code was written and we have snake AI dopy let's scroll to the bottom do you want to create the game yes go ahead all right so now it's adding all of those changes into my code base and it should be good to go yep here we we go so we can toggle the AI on or off increase the speed decrease the speed and let's give it a try all right there it is the AI is now controlling it see I'm not doing anything speed AI on pretty darn good and it says it's using the AAR algorithm to find the next piece of food all right there it made a mistake game over let's keep adding to it now now add the ability to add a second snake to the game also controlled by AI all right here it goes there it is two snakes going at each other snake two wins let's try it again all right so I can already think of a few improvements to make here all right next add multiple pieces of food at a time and add an occasional superfood that allows the snake that eats it to build a temporary 4x4 block that kills the other snake if it runs into it but not the snake that created it the superfood block should move slowly around the field for 7 Seconds all right there we go there's the superfood oh that's so so cool look at that that worked really well actually all right let's play one more time and so there it goes the superfood kind of block is moving around the two snakes are finding their own food still and there we go snake 2 wins that was very impressive now that we've seen what the coder can do let's move on to clae 3. 7 Sonic let's start with a really hard math problem let's see if Cloud 3. 7 can do it I mean this is just super impressive that I can do all this notation really easily so interestingly enough grock 3 who came up with this question gave me -1 over 27 and then Cloud 3.

7 Sonet gave me -1 over9 for the integral and so you know I'm confused which one's right and I checked with 03 mini and it is also1 over9 so I assume Claude was right in this instance now here's the thing you do need a paid account to use the extended thinking mode so the previous math problem was not even in extended thinking mode and it still got it all right let's give Cloud 3.