so today anthropic put the AI industry on hold when they announced their upgraded family of 3. 5 models you can see here it says today we're announcing an upgraded claw 3. 5 Sonet and a new model claw 3.
5 hi cou the upgraded claw 3. 5 Sonet delivers across theboard improvements over its predecessor with particularly significant gains in coding an area where it already led the field and Claude 3. 5 Hau matches the performance of Claude 3 Opus of prior model on many evaluations for the same cost and similar speed to previous generations of hi coup now there's a ton of announcements to get in but the main things are going to be clae 3.
5 and of course their incredible new agentic computer use it says we're also introducing a groundbreaking new capability in public beta computer use available today on the API developers can direct clae to use computers the way people do by looking directly at a screen moving a cursor clicking buttons and typing text claw 3. 5 Sonet is the first Frontier AI model to offer computer use in public beta and at this stage it is still experimental and at times cumbersome and error prone we're releasing computer use early for feedback from developers and expect the capability to improve rapidly over time so a lot to get through but that's just a brief overview on what they've announced SL relased so with the upgraded claw 3. 5 Sonet let's actually take a look at what anthropic have released with claw 3.
5 Sonic because this is the model that most people are going to be excited about considering it is the model that is leading the way in terms of coding and many other areas so when we take a look at the claw 3. 5 Sonet benchmarks we can see that once again anthropic have somehow managed to surpass every other state-of-the-art model in its similar class respective to gp40 Gemini 1. 5 Pro we can see the claw 3.
5 is managing to excel in all domains we can see in The Graduate reasoning level the GP QA Diamond this manages to surpass GPT 40 and Google's Gemini 1. 5 Pro and the MML U pro we can also see that claw 3. 5 leads achieving a 3% gain over its previous benchmark marks in the coding eval we can also see that claw 3.
5 Sonic manages to do better again taking it further than the GPT 40 benchmarks and in the high school math competitions we can see that there is actually a rather large jump from the claw 3. 5 Sonic new to the Claude original 3. 5 Sonic almost a double in performance in the high school math competitions so it seems that claude's mathematics has gotten a serious boost in the visual QA the mm mu which is the visual understanding this one has bumped up around 2% to a level that is now state-ofthe-art and one of the things that Claude actually mentioned was that they wanted to improve the vision capabilities so that way when you're feeding in clawed screenshots so that it can use your computer the model doesn't struggle so this was one of the things that is really incredible now there is a fascinating piece of the benchmarks in which there are only areas for Claud specific models I'm guessing that the other models haven't been tested on these capabilities because they might not have the capacity to even do these things so let's take a look at what this says it says on a gentic coding on thewe bench verified Claude 3.
5 sets a really incredible standard for what the state-of-the-art is we can see that this model gains a 49% benchmark compared to the prior version of clae 3. 5 Sonet of 33. 4% now I think this is rather incredible and I want to show you all a screenshot that is simply going to potentially blow your mind we can see here that Alex Albert says that the most significant gains are in coding the new 3.
5 Sonet sets a new state-of-the-art on the software engineering bench verified with a score of 49% using no complex scaffolding besting all models including reasoning models like open ai1 preview and specialized models for identic coding I think this kind of achievement cannot be underestimated considering the fact that this is a general purpose model that also excels at a variety of different things and now surprisingly manages to even beat a reasoning model at coding that I'm sure open AI didn't think this would even happen now one of the craziest things about this leaderboard is that the software engineering Benchmark has constantly being broken almost every single month we can see that every month or so there's around a 5% gain on this Benchmark whether it be from a new scaffolding which is essentially how you structure the models in some agentic form with tool use or with an entirely new base model that just has an internal system that is just so Advanced that you can use it to achieve better responses and thus better results we can see that Claude manages to best this getting up to 49% so this was one of the things that I truly think is absolutely incredible because it manages to even beat out many specialized models for coding we can also see that for agentic Tool use on the airline Benchmark we can see that claw 3. 5 Sonet new Benchmark is up at 46% a 10% gain on claw 3. 5 sonets 36% now one of the things that I do think about this that we can take away quickly from claw 3.
5 Sonet benchmarks is that when we look at the new areas of agentic coding and agentic Tool use we can see that other models aren't even on the radar what I mean by this with this agentic Benchmark area I think that this is going to be the standard for new AI models often times we see benchmarks that are in the GP QA the MML U and math but I think what's going to become more important as time goes on is how these models perform in agentic areas as you all know 2025 and the following years onwards will likely be the years for a gentic AI so those models that are showing early promise in agentic capabilities now are going to be the models most loved by the AI community so claw 3. 5 sets a new level that these other companies are probably going to have to follow suit which makes this an even more competitive race now what was also fascinating was that we also did get a new model class which is called 3. 5 Hau so 3.
5 Hau replaces 3. 0 Hau as the fastest and least expensive model surprisingly it outperforms many state-of-the-art models on coding tasks including the original clae 3. 5 sonnet which is insane to say and of course GPT 40 so we have clae 3.
5 Hau although it is not available today it will be made available in the coming weeks but that model having a large agentic tool use a 51. 0 on the retail and 22. 8% on the airline benchmarks is absolutely incredible considering the fact that this model is going to be the fastest model and of course the least expensive model so we're now basically getting a claw 3.
5 solit level model that seems to be faster than the predecessor and of course a lot cheaper so for those of you who are developers who are trying to build certain applications but cost may have been a factor claw 3. 5 high coup might just be the Saving Grace for what you are trying to build considering the credits for this kind of model isn't going to be that expensive now of course next we must cover these agentic capabilities and how they actually work and thoric have released a few videos which I will showcase here so that you can understand exactly how well these models do when it comes to agentic use so Claude have stated that with computer use they're trying to build something fundamentally new instead of making specific tools to help Claude complete indiv idual tasks they're basically teaching it General computer skills allowing it to use a wide range of standard tools and software programs designed for people developers can use this nent capability to automate repetitive processes build and test software and conduct open-ended tasks like research and to make these General skills possible they've built an API that allows claw to perceive and interact with computer interfaces Developers can integrate this API to enable claw to translate instructions for example use data from my computer and online to fill out this form or computer commands for example check this spreadsheet move the cursor to open a web browser and navigate to the relevant web pages or fill out a form with the data and so on and on osor which evaluates a model's ability to use computers like people do clae 3. 5 Sonet scored 4 .
9% in the screenshot only category which is notably better than the best AI system score of 7. 8 and when afforded more steps to complete that same task Claude scored 22% so in this first video this is where Claude is being shown to be used for automating operations so I'm Sam and I'm one of the researchers here at anthropic Compu is something that we felt was going to be important for a while now and so today we're going to be talking about very early version we have of computer use and walking through a representative example of the things we think is going to be useful for we're going to be going through a quick demo today in this fictional demo a customer in this case the ant equipment company has come to us and asked us to fill out a vendor request form the data I need to fill out this form is scattered in various places on my computer what we're going to do is ask Claud to look at the spreadsheet check if an equipment is in there and if not move over to the CRM and try and find some more information there once it has this data claude's going to then fill out the form for us and hopefully transfer the information across to the the vendor form the first thing that's going to happen is claude's going to start taking screenshots of my screen and quickly realizes that the ant equipment company isn't actually in the spreadsheet so the first thing it does is it swaps over to a CRM and searches for the company we're interested in luckily we get a search match and Claude then starts scrolling through the page looking for all the information it needs to fill out this form CL then autoly starts transferring the information across without me having to do anything and goes through the the steps and fills out all the information needed and then submits the form this example is representative of a lot of drudge work that people have to do this is available in the API we're excited for people to try it and we should expect things to get a lot better over the coming months next we have an example of Claude 3. 5 Sonet being used to control the computer and execute a few coding tasks to a website I'm Alex I lead developer relations at anthropic and today I'm going to be showing you a coding task with computer use so we're going to be showing Claude doing a website coding task by actually controlling my laptop but before we start coding we need an actual website for Claude to make changes to so let's ask Claude to navigate to cloud.
within my Chrome browser and ask Claud within cloud. to create a fun 9s themed personal homepage for itself Cloud opens Chrome searches for cloud. and then types in a prompt asking the other Claud to create a personal homepage for itself cloud.