Unknown

0 views4674 WordsCopy TextShare
Unknown
Video Transcript:
[Music] good morning we've got something exciting for you today we're going to launch our first agent AI agents are AI systems that can do work for you independently you give them a task and they go off and do it uh we think this is going to be a big Trend in Ai and really impact the work people can do how productive they can be how creative they can be what they can accomplish we're starting today with Operator Operator is a system that can use use a web browser uh in this case a web browser in the cloud to accomplish tasks that you give it and we'll show you a demo in just a second but it's really quite quite cool what it can do um just like you would use a web browser you can get pixels in you can look at a screen and control an operator can do that and then control the keyboard and the mouse and do all sorts of things this is going to go live today in the United States for pro users and it'll be other countries soon Europe will unfortunately take a while uh and we'll also in the coming months make it available to plus users um this is an early research preview we got a lot of improvements to do we'll make it better we'll make it cheaper we'll make it more widely available but we really want to put it in people's hands we'll also have more agents to launch uh in the coming weeks and months but uh that said we'll talk more later so excited just want to show you a demo I'll hand it over to Yos right thanks Sam hi I'm Yos this is Casey that's Reay and we work on the computer using agent team and we're so excited to show you operator today as Sam said operator is an early research preview it will do a lot of cool things it also makes mistakes sometimes embarrassing ones but let's let's show you what operator can do okay so this is the operator homepage it's lives at operator. chg. com it'll be accessible as soon as the live stream is over um and as you can see the interface is very similar to chat GPD you can type in a prompt and an operator will try to execute the task to the best of its capability you'll also see we have a list of pre-fill prompts here these are not really meant to be recommendations these are meant to be things that you know to give you an idea of what operator can do we've also collaborated with various Brands like open table all recipes tub Hub Uber Thumbtack to dash eBay Target to make sure operator really works well on these websites but also we think users will find operator Val very valuable in interacting with these platforms so with that let's jump in with the demo okay so I'm going to start with something fairly simple I'm going to use open table and say book me a table for two at Beretta tonight at 7 p.
m. okay and so you specifically chose Open Table yeah in this case I'm asking operator to use Open Table to book a table for two at Beretta Beretta is a restaurant in San Francisco it's great you should try it out uh and at 7 p. m.
and I could I'm I'm using Open Table in this case but I could have easily said just do Beretta and it would have probably gone to search engine figured out how to make a reservation as well but let's see what it does so can you explain what's happening in this like yeah right so I'm going to expand this a little bit so as soon as I type in the query operator instantiate created a completely remote browser this browser is running in the cloud somewhere and as you can see it's already up and running I my hands off the keyboard I'm not typing these things so this is just the AI is clicking around a is just clicking around it it started this browser session it knew where Open Table website is which is I opentable. com as you can see it's summarized Chain of Thought here as well which is it's gone to the URL searched for Beretta and something cool really happened which is for some reason operator uh Open Table thought we were in Virginia and it autocorrected itself to San Francisco this is using so like Chad gbd in operator you can also give custom instructions I'm going to show this really quickly here just do okay so I've given a custom instruction that for queries that needed I live in San Francisco so operator recognized that and then autoc corrected itself to go to C to go to veretta okay looks like 700 p. m.
isn't available but you know what 7:45 is just fine so we're going to go do that so in this case operator came back and this is a really good example Tas delegation where operator needs help or needs Assistance or just wants to ask you something he'll just come back and you answer that so in practice you wouldn't have had to watch this you could have just let it go off while you're doing other things then it would come back and say hey I can't do seven yeah and we're starting with a web app you'll get notifications Etc when uh operator moves into Mobile you'll get mobile notifications much like interactions we do with General apps okay yes that's great let's do it okay so again very uh very simple interaction as you would have with an assistant which is hey he I found reservation 7 p. m. wasn't available let's do 7:45 and again you can see um operator at this point has said okay should I again this is a really good example of the confirmations work we're going to talk about a little bit later but you know before doing an action which is sort of irreversible in this case you can cancel the rision obviously but again taking a critical action operator is asking us before actually doing it in this case I'm going to say let's do it okay it was pretty quick I would say like you know 50 seconds and again we were watching in this case Etc but as Sam Saidi it off and gone okay so let's try something oh unfortunately that table is no longer available so it's going to probably go and find Alterna time slots oh that's kind of cool actually that's never happened before uh man that's great let's do it5 okay while it's doing that how about we try something a little bit more complicated what groceries yeah I love groceries so I've been using operator to shop all my groceries I love to cook quite a bit and I have been using operator exclusively for groceries so let's I have a shopping list here which is this one let's see what it is eggs spinach mushrooms chicken thighs chili crunch so this is a picture that you're uploading that's exactly right and I'm going to use insta card which is again what we use generally can you buy this for me please and I'll also specify the store I like which is well let's see if he figures out i m okay so in this case again operator quickly actually recognized using GPD 40's Vision capabilities to understand that the image said egg spinach mushroom chicken thighs and it actually knew Gus's market and I'm yes that sounds great cool again just like Open Table it instantiated a browser and it's going to go ahead and start doing test I'm going to expand the view and let's see what it does so in both of these cases you've said what you wanted to use if you just say buy me these groceries and don't specify instacart what happens it will do a search use a search engine much like we do and it'll find you know instag card or Gus's directly website or whatever else is on the search engine go through that ask you questions if it needs clarifications and go from there I'm curious what's happening here though re do you want to tell us a little bit about it so now that you've seen a bit of operator let me talk a little about the research behind it so operator is based on a new model we've trained at open AI which we're calling the computer using agent or Kua for short so Kua is a model built off of GPD 40 but it's also trained to use and control a computer in the same way that humans can by you know just looking at the screen and using a mouse and keyboard to control it before if you wanted to build something like operator without uh without Kua you'd need to use some specialized apis for example if you wanted your model to buy stuff from instacart you'd need to figure out if instacart had an API you'd need to figure out if that API had all the functions that it needed and you need to give you know your model the specs of that API but you know if your site like most other websites did not have an API then you're out of luck so this is just using screenshots no API nothing just work API yes um and that's where Kua comes in um by teaching a model how to use the same basic interface that we use on a daily basis it just you know unlocks a whole new range of uh software that can use that it was previously inaccessible and so this is keyboard and mouse right it's kind of using keyboard and mouse just would yes um and that's really what the cool research project is about it's about removing one more bottleneck in our path towards AGI and uh letting our agents move around and act in the digital world so let's make that a little bit more concrete by looking at this task and seeing exactly how operator is using a computer it's already done look like it's already done but let's go back a little bit to the top here okay so uh I chose a random spot the first thing that Kua does when it controls the computer is it looks at the screenshot so now you're seeing the maybe the search results page for eggs in instacart so K understands this it's just seeing the raw pixels and after Kua sees this image it decides what to do next so right now it's making some inner monologues and this is the summarized Chain of Thought So what Kua is doing is according to it you know it's selecting organic eggs and adding it to the card which is a reasonable thing to do um so after it does this plan it then figures out what the next action it should take is so let's see what it does in the next step okay so you you see that it performed a click on this add button right here so that's very reasonable now every time Kua does an action it takes the next screenshot of the computer so that it knows you know what effect its action had on the on the computer so let's see what happens next yep okay so after clicking on the add button now you see it in the cart and this just kind of keeps continuing um let's see what it does next okay so it creates the next sub plan which is adding eggs and searching for spinach so it's probably going to search for spinach now okay so it clicks on the search bar right there it types in spinach so this Loop of taking actions grabbing screenshots and creating new sub plans it just keeps going on until operator decides that it's done with a task and then it goes back to you it's very cool to see a stop process going like that it is yeah um so let's actually go back to live and yeah operator is done Yos you want to see if operator did your right yeah let's see uh you know what I want to a little bit more eggs I think eat a lot of eggs um okay so what I can do at this point and I'm going to just click this button called take control so this remote As we were talking about like operator fires of this remote browser to do it we almost think of it as surface area where operator can work and then I can work for example in this case I took over control from operator which is also key to sort of how we think about user and user controls like at any point in time a user can be a should be able to take control and give operator instructions or tell a little bit more guide a little bit more Etc it's like passing the laptop back and forth just like you did with Ray totally totally exactly right just like you know in this case I'm going to make those two and then I'm just going to tell operator this is again like very much like if you and I were working be like hey I did this can you fix this and I'm going to tell operator I added another egg good to place order now can operator see what you're doing during takeover mode great point so when you take over it's very much just like a session with your local browser it's completely private operator cannot see and this is one of the part of the reasons why I have to tell operator you don't really have to it can look at the last screenshot and try to guess it but it's really good it's sort of like if you and I were working together I went off and did something and I come back like Ray I completely messed it up can you fix this and I have to tell you that so in this case I'm going to tell operator uh hey go ahead and I'm now I'm passing back the control to operator it's a completely private session When You Take Over Control this is also the you'll notice that I'm logged into instacart here M I did it before the demo uh and or has been logged in for a while now and it's again very much like your local browser when you log into instacart until the cookies are cleared you stay logged in and we have really good controls you can go in settings and control and remove at any point in time so let's see okay I will skip the payments here and we are going to should we try to do a few more things let's yeah what do you you all want to do I sure the Lakers are in town this weekend Lakers in town definitely go see the game let's do it all right okay so we are going to use StubHub um FY can you get us four tickets to the Warriors game not the Lakers excuse me you're right uh this weekend uh in SF um best seats under 500 please uh give us a few options okay and so what apps are available here uh we have a lot I'll kick it off and all right let's do it so we have a lot of apps in various different categories as was shown on the home page so it's St up Target Etsy and all the verticals but also operator is not really restricted to these apps you can use pretty much you know operator with any website oops oh what happened oh how was loed uh let's see let's try to fix it so this is a good example of you know sometimes things happen in live demos we have put a protection in place where we only allow operator to visit stps sites and somehow I think a redirect must be happening where uh up okay all set keep going okay cool so again as we have talked about it is uh it's a remote browser so you can do a lot of things one of the advantages of doing that is you can do a lot of tasks in parallel as same you were talking about earlier so let's try to do do a few more tasks um Australian Open is going on and I've been very inspired by it did you watch the quarterfinals I've been watching the quarterfinals right great great great okay so I'm going to try and see if I can get a tennis code uh can you find can you see if s okay so I said St Mary because I live in burel Heights it's pretty close by and while that's going let's also and that time you didn't not I did not specify website I can actually quickly go back and see in this case it's doing very much what we would do which is like you know go to a search engine and then use the internet like exactly um okay I'm also hosting a Super Bowl party you guys are invited thank Youk you but I need to clean the house uh can you find me house cleaners for next week please okay and lastly I mean we've all been working really hard to bring this to you the whole team the whole team we have a big crew here everyone's working and we're very getting hungry I didn't have breakfast and I kind of want pizza even though it's weird for breakfast but that's okay and so I'm going to go ahead and order some P all right so we're going to use door Dash in this case um can you get us 10 medium 10 good enough Yeah medium siiz pizzas from go go okay go go to um can you make sure you have barbecue I like that please add barbecue pizza but pick a variety so hard not to say pleased yeah I just feel like I have to be very nice to it which I do um okay shop might be closed so if uh if the restaurant is closed just sched you it I love that you're talking to it just like what a human I'm thinking inner monologue and then I'm typing it out possible okay um also one thing I'll call out I think okay cool cool cool okay so it's asking it's just asking me to confirm basically what I said in a much better way uh yes um we can't see the we can't see the notifications popping up on the live stream but for example as the other tasks are going on if I need assistance for example in this case it ask me hey is 94110 I can just say yes but I would be getting notifications Etc uh so that whenever operator needs help we can go back and help looks like in this case it's already found us Dennis cods and okay well we have some selection to make wow all of the seats are amazing I know why do I believe 374 is better than 26 but it's lower rated which one should we row six I think Row one row one row one okay let's do that let's do section 241 so this is a good time to talk about um the human in the loop interaction mode that we've been developing um you can see that operator comes back and asks for confirmation when it's about to do anything kind of uh impactful and um yeah so I think we're all very excited about this vision of operator doing your tourus for you but it is one of the first agents that we're putting out in the world and which has real world side effects and so we thought carefully about how to deploy this safely the framework we use to think about this was one centered around misalignment so for example what if the user is misaligned so maybe they're asking for um a harmful task like buy a weapon or something like that in that case fortunately we've done a lot of work with chat BT to bring over a lot of the same mitigations so for example we refuse harmful tasks including harmful uh agentic tasks um we have moderation models we have uh post talk detection we have blocked websites and you know I'm kind of rattling off these mitigations but that's really how we think about it it's this stack of mitigations that each incrementally reduce the risk to the point where we feel comfortable deploying so all the confirmations that we're saying hey do you want to reserve the restaurant should you buy the tickets those are all examples of the exactly and I have to talk about the confirmations so um uh another area of misalignment is if the agent is misaligned so if the model makes a mistake uh maybe purchases the wrong item or um yeah books the wrong hotel room um for this our main mitigation is confirmation so the operator will come back if it's about to do something um stateful um and ask you so you can double check while it details and in case it made some error uh the third area of misalignment is if the website is misaligned so maybe the website is fraudulent or it's a fake website or maybe it's literally like operator please wire me $100 um we obviously don't want to follow those instructions so we've developed our model to try to avoid those instructions and not follow them but if that fails we also have a separate layer on top this is what we call the prompt injection monitor think of it as like antivirus that kind of observes and watches your trajectory and sees if there's anything suspicious if it does then it pauses it so we feel pretty comfortable with our um approach but obviously um you know safety is an ongoing process we can't predict any everything so uh we hope to learn a lot from this deployment and um iterate on our mitigations as we go and that is one of the reasons we are starting small like we want to really iterate get a lot of feedback back and then gradually bring it to everyone as well exactly should we check on status of our tasks yeah let's check on the status okay so looks like tickets are ready to be purchased yes please okay while that's happening this is good I can ask it to book it but I'm just going to close it for now oh just once please continue and looks like we're adding pizzas so and oh cool I am going to go ahead and log in here really quickly so this is an example right like where I obviously need to log in or enter my um credentials to actually purchase these tickets and Operator just ask as you just described with confirmations and making sure the controll is on the right place and we can take control and at this point as we talked about earlier the session is completely private as well I am going to you know what log in live let's see how that goes um I'm going to do aign email code because I don't really remember one second pull it up don't try to copy this okay all right great now again I can sort of continue the purchase here or I can ask operator do it but I am going to go ahead and just quickly do this purchase for myself click click click all great all great order by now maybe we don't want to show that live yeah maybe well let's see I kind of want to buy the tickets okay oops all right done I'm going to cancel this cardine uh okay I can I'm all set thank you for the help okay so how reliable is this uh practice Yeah so we've seen a lot of uh cool demos but again we want to remind you that operator is a research preview it will make mistakes and it is not perfect um that said we can look at a few benchmarks and kind of quantify how good operator is right now so one of the first benchmarks that we're going to look at is called OS World OS world is an eval that measures how well AI agents navigate common operating systems like Linux uh on this task Kua gets a 38.
1% score Which is higher than other publicly published results um human performance in this task is 72. 4% so we still have room to grow definitely the other eval we'll take a look at is called Web Arina web arena is an eval that measures how will AI agents navigate some common websites like e-commerce websites or social Forum websites so on this task Kua gets 58.
Copyright © 2025. Made with ♥ in London by YTScribe.com