HUGE Claude updates! Real AI agents and better Sonnet model

10.21k views5947 WordsCopy TextShare

AI Search

Anthropic updates: Computer Use installation & testing. Claude 3.5 Sonnet New testing & review. #ain...

Video Transcript:

we have some huge updates from anthropic they've just released some game-changing AI agents that you can use right now so today I'm going to go over how to set this up and use it on your computer they've also released a new and improved version of Claude 3. 5 Sonet so I'm also going to test out its limits and show you what it can and cannot do let's Dive Right In so they've just announced that they have this new feature called computer use which I think is a game changer it's basically an AI agent that scans your computer and moves your mouse around does actions on your keyboard and basically interacts with the buttons on your interface to find information for you so here's their demonstration of this computer use feature so I'm Sam and I'm one of the researchers here at anthropic Compu is something that we felt was going to be important for a while now so today we're going to be talking about a very early version we have of computer use and talking through a representative example of the things we think is going to be useful for we're going to be going through a quick demo today in this fictional demo a customer in this case the ant equipment company has come to us and asked us to fill out a vendor request form the data I need to fill out this form is scattered in various places on my computer what we're going to do is ask Claude to look at the spreadsheet check if an equipment is in there and if not move over to the CRM and try and find some more information there once it has this data claud's is going to then fill out the form for us and hopefully transfer the information across to the the vendor form first thing that's going to happen is cla's going to start taking screenshots of my screen and quickly realizes that the ant equipment company isn't actually in the spreadsheet so the first thing it does is it swaps over to a CRM and searches for the company we're interested in luckily we get a search match and Claude then starts scrolling through the page looking for all the information it needs to fill out this form then autonomously starts transferring the information across without me having to do anything and goes through the the steps and fills out all the information needed and then submits the form this example is representative of a lot of drudge work that people have to do this is available in the API we're excited for people to try it and we should expect things to get a lot better over the coming months here's another demo so we're going to be showing Claude doing a website coding task by actually controlling my laptop but before we start coding we need an actual website for Claude to make changes to so let's ask Claude to navigate to cloud. within my Chrome browser and ask Claude within cloud.

to create a fun '90s themed personal homepage for itself Cloud opens Chrome searches for cloud. and then types in a prompt asking the other Claud to create a personal homepage for itself cloud. returns some code and that gets nicely rendered in an artifact on the right hand side that looks great but I want to make a few changes to the website locally on my own computer let's ask Claude to download the file and then open it up in vs code Claud clicks the save to file button opens up VSS code and then finds the file within my downloads folder and opens it up perfect now that the file is up and running let's ask Claude to start up a server so that we can actually view the file within our browser Claude opens up the VS code terminal and tries to start a server but it hits an error we don't actually have python installed on our machine but that's all right because Claude realizes this by looking at the terminal output and then tries again with Python 3 which we do have installed on our machine that works so now the server is up and running now that we have the local server started we can go manually take a look at the website within the browser and it looks pretty good but I noticed that there's an error in the terminal output and we also have this missing file icon at the top here let's ask Claude to identify this error and then fix it within the file Claud visually reads the terminal output and then opens up the find and replace tool in BS code to find the line that's throwing the actual error in this case we just ask Claude to get rid of the error entirely so it will just delete the whole line then Claude will save the file and automatically rerun the website so now that the error is gone let's go take a final look at at our website and we can see that the file icon has disappeared and the air is gone as well perfect so that's coding with computer use and Claude this took a few prompts now but we can imagine in the future that Claude will be able to do tasks like this end to end here's another demo I'm going to show you a simple example of computer use today my friend's coming to San Francisco next week and I want to take him to do some touristy stuff I think doing a sunrise hike with a view of the Golden Gate Bridge never gets old so I'll ask Claud to figure out some logistics for us I'll ask CLA to find a good place to see the sunrise to help me figure out timing Logistics and help drop a calendar invite so I remember when I have to leave it's opening Chrome going to Google searching and it looks like it's found something great so how far away is the location from my place it's opening maps searching for the distance between my area and the hiking location cool so now it looks like clot is searching for the sunrise time tomorrow and is now dropping it into my calendar and populating it with some [Music] details and great it looks like CLA did it this is a simple example but we're sharing computer use early to learn from what people build this is actually what I've been waiting for this is truly an AI agent or at least the start of one because previous so-called agents they don't actually look at your computer screen and then move the mouse or keyboard accordingly they just use some python or JavaScript code to execute key presses or button presses or extract data from a site but this is all programmatic it's not really looking at a screen and moving your mouse around or pressing keys on a keyboard but we humans just look at a screen and try to figure out where things are on the screen and then click accordingly and this computer use does exactly that now they mentioned that this computer use feature is available today on the API so developers can tap into this feature right away so developers can direct CLA to use computers the way people do by looking at a screen moving a cursor clicking buttons and typing text however at this stage it is still experimental at times cumbersome and error prone in fact if you scroll down a bit here is a very key metric so on OS World which is a benchmark that evaluates how these AI agents actually perform tasks on computers the way humans do they found that Claude 3.

5 Sonet scored 14. 9% in the screenshot only category it's still better than the next best system out there which only had a score of 7. 8 but basically this means that it was only able to complete around 15% of tasks so far not even over 50% but it is a good start all right let's actually try to get this computer use feature up and running So currently it's only available via their API so there is some technical steps involved but I'm going to walk you through all of it in this video so in console.

anthropic dcom log in or sign up for a free account and then once you're in you should see get API keys I do not have a key yet so I will click on create key and then I'm going to select the workspace I'll just call this comp use and then click add all right so here is my my secret key I will copy this and save it somewhere do not expose this key to anyone I'm going to delete this key before I publish the video all right I'm going to click close now and then the next step is we need to go to this GitHub page which I'll link to in the description below and all you need to do is run this code in your command prompt now this is using Docker and I'm using Windows so we need to First install Docker if you haven't already so first of all let's open up Powershell and I'm going to open this as administrator and I'm going to type in WSL D- install so this will basically install the windows subsystem for Linux onto your system now I already have this installed so it's going to say this is already installed all right the next step is to download Docker I'm also going to link to this page in the description below but when you hover over this button just click on whichever operating system you're using and I am using using Windows x64 so that's basically the same as amd64 so I will click on this one so after that is completed let's open up the exe file and let's go through the installation so I'm going to click okay it's now proceeding to unpack the files and install everything on our computer all right so after our installation has succeeded if you open up command prompt and you type in Docker D- version and if you see some Docker vers version then it signifies that you have successfully installed Docker all right so first thing is we need to run Docker desktop first so I'm going to click run as administrator and then I'm going to click accept and then finish and then I'm just going to go through these setup steps and now it's starting the docker engine all right so if all goes well you should see down here engine running so you do need to have Docker running before we can proceed to The Next Step anyways the next step is going back to this GitHub Page by anthropic all we need to do is run all these lines of code in our Command Prompt so let me open up command prompt and I will click run as administrator and then we first need to import our anthropic API key now because I'm using Windows there are some minor changes we need to do to this code it's not going to be export but instead set so I'm going to write set and then anthropic API key equals and then I'm going to paste in the API key that we copied previously and after that's done the next step is to run this now again because I'm using Windows there's going to be some minor changes we need to replace these backs slashes with an up Arrow so instead of this at least for Windows you should replace the backs slashes with these up arrows anyways let's press enter and if all goes well it should start downloading from this repo up here so it's not too much to install just around like 1 gab I'm going to pause this video and come back when it's done all right so if it's successfully installed everything you should then see this message computer use demo is ready open up this URL in your browser to begin so let me copy this and then I'm going to paste it in Chrome and voila we have the computer use demo up and running so here is where you can chat with it like a chat bot and to the right here is a Linux operating system so let's try this out I'm going to write go to my YouTube channel find the 10 most recent videos and add them to a spreadsheet with columns title views duration let's see if it can pull this off and I'm getting this message because I have not added any credits yet so let's add some credits here all right so I've just added $5 to my balance let's open this up again I'm going to copy the same prompt paste it in here and then click run all right perfect so you can see it's running the agent first it's going to open Firefox and then navigate to the channel notice I'm not moving the screen it's doing this all by itself which is pretty insane let's see what it does next so okay and then it looks like it's going to install Libra office next to create the spreadsheet and then it's running into some errors so again as I mentioned it can only complete like 15% of tasks so far so this isn't really that robust yet I'm going to help it out a bit here and open a spreadsheet first so let me click on labor office to open a spreadsheet and then click okay and then I'm going to use this prompt so go to my YouTube channel click on the videos Tab and then find the 10 most recent videos and then add each one to the spreadsheet let's see if it can automate this all right so next let's click on Firefox now it's opening up my YouTube channel can it go to the videos tab yes it did click on the videos tab so now hopefully it will find the most recent 10 videos all right it's opening the spreadsheet now what is it trying to do okay all right okay so yeah you can see now it's entering my videos including the title The View count and the duration very impressive and again note that it's doing all of this automatically I'm not controlling the screen all right so it looks like it's now fetching the next set of videos cuz I specified the 10 most recent videos so right now it's taking a screenshot of the next set of videos and then trying to interpret and extrapolate the title view count and duration perfect perfect I think we're done now so I'm just going to click stop here at the top and then going back to our spreadsheet let's compare this to the ground truth so if I scroll back to my prompt first of all it did not add the columns title views and duration here so it got this Incorrect and then I actually don't know where this is coming from so if you look at my most recent video it should start with real time AI video games with 49k views so this should actually be number one so this row seems incorrect this is number one it did register 49k views and the duration is 3146 which is what we see here perfect next up this free AI text to speech 60k views 4444 and that is indeed video number two and then video number three AI video just got better 54k views 2905 that is exactly what what we have here now there is a small formatting problem here let's see which video this is new open source AI video so it's this one the duration should be 1456 but somehow it got 256 over here so that's incorrect and then same with this one this free AI can control anyone's face the duration should be 2227 but it got 1027 here for some reason so overall it's still very ever prone as I've mentioned before it only works for like 15% of use cases but I mean this is just the start I think you can use this to automate a ton of tasks in the future from extracting data compiling spreadsheets replying to emails and doing cold Outreach and sales I mean once this improves and gets less error prone it can make you just infinitely more productive thanks to ior for sponsoring this video ior is a powerful AI assistant designed to enhance academic and creative writing this versatile tool addresses common challenges faced by writers one of ior's standout features is it can automatically add references to your text following standard citation formats like apaa and MLA this feature saves a ton of time also ior can help generate topics and a table of contents which can provide you with inspiration and help fight writer's block and if you're concerned about the the authenticity of AI generated content ior has got you covered they have a built-in AI content detector which helps you assess the likelihood of text being identified as AI generated plus they have an AI disguise tool which you can use to refine your text and make it sound more human and less AI now of course each person has their own unique writing style so wouldn't it be great if there was an AI that could actually write in your style well I Thor also has a personalization feature which helps you adjust generated text to match your own writing style all of these features work together to help you write content that is natural authentic and accurate visit ior via the link in the description below and see how it can supercharge your writing unlock more features using my code AI search 10 to receive 10% off in addition to this computer use agentic feature they also announced an upgraded clae 3. 5 sonnet and 3. 5 Haiku now if you're familiar with the Claude family of AI models at least in version 3 they had three model names Haiku Sonet and Opus Haiku was the smallest model so it's the least performant and quote unquote intelligent but it runs faster and then we had Sonet which is the mediumsized model it's slightly more performant but the cost is higher and also it takes longer to run and then finally the most intelligent is Opus but of course it has the most parameters it takes the longest and it costs the most to run now for the 3.

5 family looks like at least for today they've released the Hau version but for the biggest model CLA 3. 5 Opus there are no signs of that so slight disappointment there anyways let's look at the Benchmark metrics of both this new clae 3. 5 Sonet and clae 3.

5 ha cou so I'll link to this page in the description deson below but here they list several Benchmark metrics and here is the new CLA 3. 5 Sonet which is just called well clot 3. 5 Sonic new and you know the naming Convention of these AI companies just really confuses me for example earlier this year Google also announced their new Gemini 1.

5 Pro which is called Gemini 1. 5 pro2 I mean why not call this Gemini 1. 6 Pro and then here similar with anthropics clae 3.

5 Sonet if they've released a new version why not call this clae 3. 6 Sonet I have no idea why it's just what it is anyways this first column on the left is the newest sonnet model and then this third column here is the older sonnet model and then here we have GPT 40 however it does not mention which release of GPT 40 they're talking about so you all also have to take this with a grain of salt because if you look at the LM Cy leaderboard there's actually multiple versions of GPT 40 there's the September version and then there's the May version and then there's also an August version down here which surprisingly has a lower Lo score than the May version anyways basically there's different versions of GPT 40 and it doesn't specify here which version they're referring to so also take this comparison with a grain of salt anywayss if you compare this new 3. 5 Sonet with the older one you can see for graduate level reasoning it has gotten over 5% better and then for MML U pro which is undergrad level knowledge it also performs 3% better than the previous generation however for coding interestingly this was only a 1.

7% increase compared to the previous version so just a very marginal increase nothing huge and then for math it got like 7% better however interestingly Gemini 1. 5 Pro is actually the best at math problem solving and then across all these other benchmarks it seems to also outperform these other models however these are just their numbers I would really like to see an independent evaluator rank this new 3. 5 Sonet among these other models so right now I'm looking at a few leaderboards like livebench by Abacus Ai and I don't see the newest Cloud 3.

5 Sonet here yet I'm also looking at at Scales seal leaderboard and I also don't see the newest Cloud 3. 5 Sonic here yet I mean they've just released this a few hours ago at least at the time of this recording so I don't expect them to list this on here that fast same with the leaderboard from artificial analysis I don't see Cloud 3. 5 Sonic new yet so anyways I'll keep you posted in a future video once I do see these benchmarks so you can get a sense of how this new model compares to all the other state-of-the-art models like GPT 40 and Gemini 1 .

5 Pro again one thing to note from this announcement is that their biggest model in the family Opus has not been released yet there's actually no mention of cloud 3. 5 Opus anywhere in their documentation so I believe before they've had a cloud 3. 5 Opus coming soon mentioned somewhere on this page but they completely took down any mentions of this right now so I'm not sure if there will be a 3.

5 Opus model anyways enough talking about benchmarks let's actually jump in and test out this new 3. 5 Sonet so if you go to cloud. a you should see that starting today you'll have access to this new model so let's test it out with some challenging prompts the first one is make the game Tetris using python so far none of the other state-of-the-art models could actually get this in one go I had to prompt it at least two or three more times for it to actually get a working game of Tetris all right so here's the code and you know what's cool is it asked if I want to implement these additional features like high score tracking different difficulty levels sound effects more visual effects that would be really neat but let me just copy this code first and then paste it in an empty python file I will click save and then I will run this let's see if it works uhhuh and we get an error so even the newest Cloud 3.

5 Sonet is unable to build me a fully functional Tetris game in one go all right so I'm going to just copy and paste the message paste it in here and press enter so let's wait for this to generate all right let's copy this new code and I'm going to select all paste everything in here click save and then let's run the code all right finally this works and yes if I press the up Arrow it does rotate the piece let me move this down wow it even shows the next piece in the sequence that is awesome all right let's see I'm going to attempt to make a new line here and yes it disappears I'm going to now try to kill myself by reaching the top of the page and let's see if it gives me a game over message all right last piece last piece ah game over press enter to restart perfect so in just one additional prompt it was able to give me a fully functional Tetris game I think even with g PPT 40 or even 01 it took like one or two additional prompts for it to get this correct so this is not bad by the way there's a lot of cool stuff you can do with Cloud 3. 5 like getting it to duplicate the design of a website generating reports for you building 3D games building an audio visualizer there's a lot of cool things you can do so check out this video if you haven't already in this video I'm really trying to test harder prompts to test its quote unquote intelligence all right next one I'm going to start a new chat and then I'm going to enter this prompt John is twice as old as Mark in 5 years the sum of their ages will be 65 how old is each person now let's click enter to generate this perfect now this question is tricky because their age isn't actually a whole number if you just use whole numbers as it says here this wouldn't actually satisfy the conditions given so it got the answer correct it gave me the decimal number of their age all right let me start a new chat and I'm going to paste in this just for fun I'm sure most language models will be able to get this correct now oh my god really that is ridiculous I mean this question how many RS are in Strawberry this is a very popular question already and I would expect you know the developers of these large language models to at least train a model that would get this question correct because they know that everyone is going to test these new models on this question so I'm really surprised that it's still did not get this correct so let me count carefully by marking each R so here's an R here and oh it looks like it skipped this one and then here's the other R here so there are two RS in Strawberry that's pretty ridiculous to think about how the best version of claw right now still cannot count the number of RS in Strawberry all right let me start a new chat and here's another trick prompt that seems to throw a lot of language models off which number is bigger 9. 11 or 9.

9 all right perfect so it got this one correct indeed 99. 9 is bigger than 9. 11 all right here's another question that throws a lot of language models off a farmer and a sheep are standing on one side of a river there is a boat with enough room for one farmer and one sheep how can the farmer get across the river with the sheep in the fewest number of trips let's see what this gives us and it nailed it so yes the farmer and the sheep can get in the boat and cross together so it only takes one trip for both of them to get across the river there's no need for multiple trips back and forth Perfect all right next prompt is to test if it hallucinates so here the prompt is tell me about stable diffusion 5 in a short paragraph now stable diffusion 5 does not exist so I would expect it to answer this does not exist instead of just making up an imaginary description about stable diffusion 5 so let's see what we get okay this is kind of acceptable so it says says I should note that since my knowledge cut off is April 2024 I may not have fully accurate information about SD 5's final release or capabilities which doesn't exist by the way we're not even at sd4 yet what I can say is that sd5 is a highly anticipated release from stability AI expected to bring significant improvements in image generation quality etc etc this part is already wrong so they have not announced anything about a fifth version of stable diffusion so it's just making up all this stuff here all right let me start a new chat and then let me test its like reasoning and planning capabilities so the prompt is you are tasked with organizing a week-long summer camp for teenagers the camp includes various activities such as Sports arts and crafts and educational workshops how would you create a schedule that balances these activities keeps the campers engaged and ensures their safety and well-being so let's press enter and see what we get and note that there is no correct answer for this I'm just trying to see how it would plan this out and if its output actually makes sense all right so I like this artifacts feature again where it kind of outputs its answer in this separate window so you can inspect the message and the output at the same time anyways let's scroll up to the top here is the weekly structure so first is the general daily structure this is in general what is done every day which I like and then detailed daily structure so for Monday we have Camp orientation and this is actually what I was looking for so some less performant models like llama on Monday they just jumped straight into the camp activities without any orientation or safety briefing or introductions and that's just not correct but here it was able to incorporate you know these introductory activities first before jumping into the camp activities so extra points for that and then afternoon we have sports rotation swimming assessment evening welcome campfire s'mores and storytelling all right next one so Tuesday morning we have art photography poetry and then afternoon we have nature hike outdoor sketching here is where I would say it performs a bit worse than gp01 so in gp0 1's response it actually mentions that well after lunch the kids are going to be a bit more sleepy they're not going to want want to go out and do active things so it's best to actually Reserve these arts and crafts activity for the afternoon and then do these higher action physical activities in the morning which actually makes a lot of sense so here in Claude 3.

5s response it was not able to incorporate this into its schedule so I would say in that case 01 performs slightly better than this and then all right Wednesday it seems like every day has a theme so Wednesday is Science and Tech we have stem workshops after afternoon we have some group science projects Tech treasure hunt all right Thursday is sports and wellbeing again I don't really like the structure of this how it's just grouping all sports into the same day so in the morning we have sports tournament and then yoga meditation session and then afternoon we have swimming team sports first aid workshops and then in evening we have another Wellness workshop and team challenges I think that's just too much for one day like evening we should just chill and relax eat some s'mores or something the morning should be reserved for high action physical activities and then the afternoon should be reserved for Less physical activities like yoga meditation or first aid workshops so again not as impressive as 01 but still pretty good definitely better than llama and then Friday we have creative expression all right and we have this awards ceremony on Friday uh shouldn't the awards ceremony be at the end anyways Saturday is Adventure all right and I like that for Saturday evening they have this group reflection which is a nice addition and then Sunday is farewell and it's nice that you know this is reserved for packing up cleaning up group photos and then they also have a closing ceremony at the end so overall not bad there's one extra point that gb01 had in here which is after the camp was over also reach out to the parents and the kids with a survey to see how you you can improve the camp better and it seems like at least for cloud 3. 5 sonnet's response it does not have that suggestion so again not as good as 0 one's response but still okay all right let me start a new chat now and here is another challenging question Alice has n brothers and she also has M sisters how many sisters does Alice's brother have so let's click enter and see if it can solve this and yes the correct answer is m+1 so it nailed this as well now one additional thing to note is that this does not have online search so you can't ask it to fetch you the most up-to-date information so for example if I ask it what's the price of Bitcoin on let's say October 22nd 2024 let's see its response so here you can see it says I have a knowledge cuto off from April 2024 so I can't give you the most up-to-date information basically so anyways that sums up my very initial testing and review of Claude 3. 5 Sonet new note that again there are a lot of cool stuff you can do with Claude and their artifacts features so check out this video if you haven't already where I go over how you can make an interactive infographic report how you can create an audio visualizer how you can pretty much clone the look of any website and a lot more cool features and this new version of cloud 3.

5 Sonic can also do all of that so that's why I didn't really showcase much of it here in this video I mostly just wanted to feed it some tricky prompts to see if it can get it correct so that wraps up my video about this Claude 3. 5 Sonic new which I'm still confused why they didn't just call it 3.