The Best AI Model Just Got a Big Upgrade - New Claude 3.5 Sonnet and Haiku

16.43k views2506 WordsCopy TextShare

Skill Leap AI

Anthropic has introduced two new AI models: the upgraded Claude 3.5 Sonnet and a new version, Claude...

Video Transcript:

Claude just introduced two new AI large language models one of them is Claude 3. 5 sonnet but this is a brand new version of it with much better reasoning and coding abilities and they also introduced a brand new claw 3. 5 ha cou so far we had claw 3 Hau so this is a lighter weight version of it and they also introduced a brand new way to use AI called computer use they have a really nice two-minute demo that I'm going to show you this is extremely powerful available today in their API claw 3.

5 Sonet the new version of it which was my favorite model I think it's the best model available right now but he got even a bigger upgrade and it's available to use right now I'm going to show you this with some reasoning and coding examples in this video If you haven't used Claud before I just want to show you one thing really quick about the popularity of it versus chat GPT so let me just show you this this is the Claude website so cloud. it's getting 70 million visits a month okay so that's pretty good but let me go to the chat gp. com and show you the same analytics 3.

1 billion visits a month so even though Claude has been beating chat GPT for a while now they have the best model that 3. 5 Sonet was already the leading large language model compared to GPT 40 it was winning in every Benchmark now they have an upgraded version but still chat GPT is just completely dominating at three billion visits a month compared to 70 million visits a month so that's kind of shocking to me I use Claude way more than I use chat GPT okay now back to this they also introduced something called computer use so this is available in public beta experimental option available inside of API so this is really meant for developers but this can direct Claude to use the computer the way people do by looking at a screen moving a cursor clicking buttons and typing text Flo 3. 5 Sonet is the first Frontier AI model to offer computer use in public beta and they have this demo right here and I'll show you this in a second and then we'll jump in and test out CLA I'll show you their benchmarks obviously every time we have benchmarks we got to test it for ourselves to see if it keeps up but look at this claw 3.

5 Sonet this new version of it and this is the Haiku model comparing it to the best model available right now you see gpt1 by the way is missing from this that's the reasoning model from open AI because it's pretty Limited in usage right now so these are for more for everyday use but I wish they did compare this to the 01 model because they're saying claw 3. 5 Sonet is a better reasoning model well GPT 40 is not the reasoning model from open AI the 01 model is so that's not in this Benchmark so keep that in mind but compared to GPT 40 and compared to Gemini 1. 5 Pro which is the best version of Google Gemini completely beating these models and if you go to claw.

you'll see a new model here this is claw 3. 5 Sonet right now you could choose it here claw 3. 5 ha coup is going to be rolling out in the next few weeks not available right now but this is obviously their best model it's going to beat claw 3.

5 high C2 so this is the one we want to go ahead and test out and I have some new prompts here I want to test it out we're going to test out some coding before that let me just show you this demo right here this is for computer use available inside of API so this this is really meant for developers but as a non-developer the power of this kind of blew my mind I think this is going to be a huge leap for AI once this becomes kind of mainstream so I'm Sam and I'm one of the researchers here at anthropic Compu is something that we felt was going to be important for a while now so today we're going to be talking about a very early version we have of computer and walking through a representative example of the things we think is going to be useful for we're going to be going through a quick demo today in this fictional demo a customer in this case the ant equipment company has come to us and asked us to fill out a vendor request form the data I need to fill out this form is scattered in various places on my computer what we're going to do is ask cro to look at the spreadsheet check if an equipment is in there and if not move over to the CRM and try and find some more information there once it has this data claude's going to then fill out the form for us and hopefully transfer the information across to the the vendor form the first thing that's going to happen is cla's going to start taking screenshots of my screen and quickly realizes that the ant equipment company isn't actually in the spreadsheet so the first thing it does is it swaps over to a CRM and searches for the company we're interested in luckily we get a search match and caud then starts scrolling through the page looking for all the information it needs to fill out this form CL then autonomously starts transferring information across without me having to do anything and goes through the the steps and fills out all the information needed and then submits the form this example is representative of a lot of drudge work that people have to do this is available in the API we're excited for people to try it and we should expect things to get a lot better over the coming months now that's the AI that I think we've been promised AI agents that could do things for you on your computer you open up different pages it open up different spreadsheets it scrolled through different pages it typed out text it filled out a form it successfully completed an entire task right so that is where AI agents are going to lead us right now this is one of the best use cases I've seen for it obviously early demo experimental beta what computer use could be really powerful so if you're a developer let me know what you think of it if you're testing it out it's available right now in the API I'll dig into it a little bit but I'll save that for a different video right now I want to test out Claud 3. 5 Sonet the new version of it let's go ahead and do some reasoning and logical testing for writing and everything else I already use cloud so I've already compared that with GPT 40 and it's my preferred model for summarizing text for writing articles for all kinds of different use cases marketing but right now for logical reasons 01 is obviously the best model but let's see if we could get some of the same results I tested o1 in detail in a different video I'm going to use some of the early reasoning prompts from that video but I have some brand new prompts and uh riddles and things like that I want to test out with this new model how many RS in a strawberry something that large language models struggle with oh this is interesting the way you laid it out there are three Rs in Strawberry which is right so pass there okay I'll start a new chat which number is bigger 9. 11 or 9.

9 the 01 model gets this one right let's see so 9. 9 9 is greater than 9. 11 okay good he got that right and it kind of walks you through all the steps that he needed to take to get that answer right now the next one what comes once in a minute twice in a moment but never in a thousand years the letter M appears once here twice here and never in the Thousand Years got that right two fathers and two sons went fishing each caught one fish but when they returned they only had three fish how is this possible okay let's see what we get there let me think through this step by step so it's kind of using similar Chain of Thought prompting that the 01 model uses in the background but it spells it out for you so it says it seems like there's four people at first two sons and two fathers but he breaks it down on how that's not the case and there's one grandfather so that's one father and then he has a son who is also a father and a son who caught one fish and the grandson who is just one son caught one fish so the answer is right and he walk through the reasoning here and it makes sense let's try a more challenging one you're given 25 horses and a racetrack where only five horses could race at any given time you do not have a stopwatch and you need to find the three fastest horses what's the minimum number of races that you need to run to determine the top three horses now again this is using the same step by step so that same Chain of Thought prompting it looks like this is how it's trying to get to that answer it's walking us through all the steps and I know the answer is seven okay wow it did get that right and it spelled out all the different steps he needed to walk through in order to get that there but that's impressive he got that one right too I want to see if he knows how to count Words so I'm just going to give it one more riddle what came first chicken or the egg now the scientific answer is actually quite clear the egg came first the first chicken egg was laid by a bird that was almost but not quite a chicken good answer now let's see how many words in this response okay this is something brand new I've never seen any AI model answer like this but look how it's counting it's just counting one word at a time like this so let me just copy and paste this right here into Microsoft Word and see the actual W count this one said 88 and Microsoft W says 10 S I wonder why that didn't get it right okay I just manually counted and I also got 107 so Microsoft W is accurate but I don't know exactly what went wrong here because it looks like it went all the way down to the very end behind this right here is the end and it started with this is a fascinating okay so not sure but it did not get the counting right just like every other model has hasn't been able to get the counting of words right let's try a couple of different coding examples so one I run a lot of times just to kind of keep it consistent across different models is a game of checkers and I just say write a game of Checkers in Python that I could run on my Mac okay the first version it gave me it just ran it inside of this little terminal app so that's not what I wanted so sometimes I notice with different models sometimes they give you this version if you don't really spell out that you want this as a standalone app so this one the very first time gave me this so let me ask it I just asked it to give me this as a standalone app and he gave me new code okay let's see how this one works let's move the red piece oh that's not right I have not even played and it's already not functioning okay let me just try one more followup prompt it looks nice but it's obviously not working which is not great I usually get a much better result in the very first prompt out of GPT 40 I'm going to say the game logic is broken get some new code okay let's try this again R's turn let's click there let's move this guy okay now it seems to work oh nope I still can't take a piece so the only thing that it did was get the starting right but it doesn't seem to be able to take a piece on either side so that was three different prompts I really haven't needed to give GPT models I think I even got the Llama model the new llama model to even take it a step further here so not quite what I needed to get out of this example let's try one more this one I won't tell it to write in Python write code for a game of Tetris I can run on my Mac as a standalone app let's see if we get something useful here and by the way if you're a developer let me know in the comments section if you have a very specific example of a prompt I could use to kind of test coding For an upcoming thing and what I should look for exactly and I'll try it for for my next comparison video okay this seems to be working really well actually let me just fast forward a bit okay let me see if it could actually clear a level here all right perfect so this seems to be working really well let me see what happens if you lose the game okay perfect game over exactly how it should end and it's a win on the tetris side but Checkers it was a fail and I was pretty much these days could get any model to take that a little bit further so we'll go ahead and test this out a little bit more and I'm going to dive into the computer use a little bit too hopefully I'll have a video about that coming up but let me know what you think of this new model and let me know if you like the GPT models more or if the Claude models more claw 3.