today Google have introduced Gemini 2. 0 which is their new AI model for the agentic era in this video we'll dive into exactly everything Google has announced and how Gemini 2. 0 is going to completely change how we interact with AI Gemini 2.
0 enables new kinds of multimodal AI agents agents that can see and hear the world around you think plan remember and take action Gemini 2. 0 is what enables project Astra our research prototype of a universal AI assistant using capabilities like multimodal memory and realtime information it lets you understand your world live what can you tell me about the sculpture the sculpture you're seeing is called my world and your world by Eva Rothchild located in Lewis Cubit Park in London what kind of themes does her work explore she creates abstract sculptures that invite viewers to interact with and interpret their surroundings in new ways project Astra is multilingual using native audio to seamlessly switch languages as you speak we're taking this to the next step with project Mariner with agents that can complete tasks on your behalf you can ask it to do things like this that require multiple steps doing research to look up the artist then finding the painting and then shopping for the right supplies the agent uses the web to take action and find what you're looking for it plans and reasons at each step always staying in your control and completing your task Gemini 2. 0 enables agents that can be applied across domains we're exploring agents that can explore Virtual Worlds assisting you in video games where do you recommend I attack from on this base given the base layout I recommend attacking from the bottom or Southside this direction allows you to Target the town hall directly with your Giants while the Wizards can handle the surrounding defenses Gemini 2.
0 can reason about the physical world understanding 3D spatial environments and objects these capabilities are being applied to robotics with agents that can assist you in your everyday physical environment learn more about how Gemini 2. 0 is enabling the next generation of multimodal AI agents at deepmind gooogle Gemini next let's take a look at a real world use case on how individuals can use Google's Gemini 2. 0 baked into project Astra that enables agentic capabilities such as tools memory and multimodal reasoning all right we've got a pixel phone loaded with the latest test build of project Astra and we're going to leave the office to test it around London I've got this email with apartment info can you tell me the door code and remember it I'll remember the door code for you it is 1170 to unlock it enter the code and then press the unlock button based on these symbols how should I wash this the tag shows to machine wash at 30° Celsius avoid bleach and Tumble drying iron on low and dry cleanable then what setting should I use on this machine based on the laundry instructions of 30° C select the 30 setting under easy Care on the dial I've got this list of recommendations can you tell me a bit about some of them I can search for these places for you cold drops yard is a shopping destination with unique industrial design luminary Bakery is a social Enterprise Bakery and Tate modern is a modern and Contemporary Art Museum now we're going to try project Astra out on the street let's see what it can do I've always said scone but now I'm hearing people say scone can you help me with the pronunciation it's a fascinating topic the pronunciation of scone both scone rhyming with gone and scone rhyming with bone are accepted pronunciations though Regional variations exist what can you tell me about the sculpture the sculpture you're seeing is called my world and your world by Eva Rothchild located in lwis Cubit Park in London what kind of themes does her work explore she creates abstract sculptures that invite viewers to interact with and interpret their surroundings in new ways uh would these do well if I planted them back home Mountain fleece can grow well in New York they prefer full sun to partial shade and moist well- drained soil so your garden in New York should be a good fit for these flowers I'm trying to get a gift from my friend and this is everything she's been reading how would you describe her taste in books based on these titles your friend seems to enjoy a mix of History biography and perhaps a touch of political or social commentary I picked some of these out any idea which one she'd like the most consider half of a yellow sun for its exploration of complex social and political issues within a historical setting aligning with her interest in narratives that offer insight into significant periods will that best take me anywhere near Chinatown yes it does the 24 bus route goes through Lester Square which is very close to Chinatown are there any landmarks on the way riding the number 24 bus notable landmarks you will likely encounter are Westminster ABY Big Ben and traler square I also met up with some friends to test out the multilingual capabilities now I'm using these prototype glasses to test project Astra handsfree can you check if it's going to rain anytime soon it looks like it is currently mostly sunny in London rain is expected tonight from around 10: p.
m. until tomorrow morning at 6:00 a. m.
hey what's that park next to me the park looks to be Primrose Hill known for its panoramic city views and the notable Shakespeare's tree can you check if I can bike in there it appears that cycling is actually banned within Primrose Hill itself although it is permitted in nearby areas and the rest of Regent's Park I'm biking back to Camden can you check if there's any supermarkets along the way there's a Saints faes on Camden Road a morison on Chalk Farm Road and an m and s simply food I captain High Street what was the door code again the door code you saved previously is 1170 oh thanks that works next let's take a look at Google's Project Mariner this is how you can actually take action in your Chrome browser with an AI agent today I want to tell you about project Mariner it's a research prototype exploring the future of human agent interaction and it's built on Gemini 2. 0 like with all new technology it's important for us to build this responsibly which is why we're starting small we'll be getting feedback from a group of trusted testers and using their experiences to really shape how project Mariner evolves let me show you how it works so project Mariner works in the browser as an experimental Chrome extension I'm going to start by entering a prompt here I have a list of outdoor companies listed in Google Sheets and I want to find their contact information so I'll ask the agent to take this list of companies then find their websites and look up a contact email I can use to reach them this is a simplified example of a tedious multi-step task that someone could encounter at work now the agent has read the Google sheet and knows the company names it then starts by searching Google for Benchmark climbing and now it's going to click into the website you can see how this research prototype only works in your active tab it doesn't work in the background once it finds the email address it remembers it and moves on to the next company at any point in this process you can stop the agent or hit pause what's cool is that you can actually see the agent's reasoning in the user interface so that you can better understand what it is doing and it will do the same thing for the next two companies navigating your browser clicking links scrolling and recording information as it goes you're seeing an early stage research prototype so we sped this up for demo purposes we're working with trusted testers to make it faster and smoother and it's so important to keep a human in the loop after the fourth website the agent has completed its task listing out the email addresses for me to use and there you have it we're really just scratching the surface of what's possible when you bring agentic AI to computers and we're really excited to see where this goes next let's actually take a look at how you can use project Mariner to do different things such as online shopping and interacting with the web more efficiently today I'm going to walk you through a demo of project minor a research prototype we built with our new gemi 2. 0 models let me show you how it works I'm going to start in the Gemini app where project Miner is an experimental Chrome extension that works across all web pages in this demo I'll be prompting the agent to find the most famous post impressionist find a colorful painting of theirs on Google arts and culture then add some colorful paints to my Etsy cart in Gemini the agent starts by asking who the most famous post impressionist is Gemini quickly identifies vanan go as the most famous post impressionist painter from here it navigates to Google arts and culture by typing in the URL into the address bar you can see how this research prototype only works in your active tab it doesn't work in the background it opens the search bar to look for Van go van go is the first option so it clicks on it now it decides to click on the audit page to look for a colorful painting we're able to see the agent's reasoning in the user interface throughout the process so we can easily follow along to understand what it's planning the agent realizes that page only contains information about W go so it decides to scroll down to look for artwork it discovers a collection of art and starts looking at them to find a colorful option the agent used its multimodal understanding capabilities to find a colorful painting of irises and clicks on it after finding this painting it navigates to Etsy by typing in the URL it clicks on the search bar and searches for colorful paints so now it's looking at different paint sets and the agent gives us a sense of what it's optimizing for as you can see it's optimizing for price and visual appeal the agent clicks on a set of watercolors that appears to be a good fit it's then added to card after a moment the agent asks me if it should proceed to check out this is an early research prototype but it's a great example of how we are keeping a human in the loop and ensuring you're always in control I'll just respond by saying there's no need to check out and now it completes its task and hands back control project manner is now available to a select group of trusted testers as an experimental Chrome extension they're giving us feedback to help us build this safely and responsibly we are excited to see where this technology goes and what people create with it next we have Gemini 2.
0 Flash and Gemini 2. 0 flash Builds on the success of 1. 5 Flash the most popular model for developers now this is absolutely incredible because this model is fast notably they state that 2.
0 flash even outperforms 1. 5 Pro on key benchmarks and at twice the speed 2. 0 flash comes with new capabilities in addition to supporting multimodal inputs like images video and audio 2.
0 flash now supports multimodal output like natively generated images mixed with text and steerable text to speech multilingual audio it can also natively call tools like Google search code execution as well as third-party userdefined functions Google's goal is to get these models into people's hands safely and quickly and over last months they've been sharing early experimental versions of Gemini 2. 0 getting incredible feedback from developers Gemini 2. 0 Flash is now available as an experimental model and I'm going to show you guys exactly what this model is capable of one of the first things I want to show you guys is Gemini's ability to have live streaming take a look at this demo okay this is a quick demo of multimodal live streaming in Gemini 2.
0 and we are casting a tab into AI studio um can you see my screen yes I see your screen it's showing a document with the title demo it also contains bullet points and a Google meet window with a person in the video is there anything in particular you would like to talk about on the screen yeah can you start the demo by just reading the text I highlighted sure I can do that the text you highlighted reads the multimodal live API lets you build realtime multimodal applications powered by Gemini 2. 0 flash it features tool use like Google search and code execution function calling and more would you like me to read the following lines too no can you actually just tell me what this word means though I can do that multimodal refers to the ability to process and understand different types of data like text images and audio alt together does that help yeah um I need to demo the interruptions feature next can you tell me a really boring kind of Once Upon a Time sort of story okay I can do that once upon a time in a small village nestled beside a quiet Forest there was a very unremarkable Rock This Rock was gray and it was round and it sat there I to interrupt you no problem I understand that interruptions are part of the demo what should I do now um so we're going to demo memory can you just summarize everything we've seen or heard so far I can do that so far we've discussed a demo of the multimodal live API which lets you build real-time multimodal apps and I Define what multimodal means I was also asked to tell a very boring Once Upon a Time story but I was interrupted is there anything else uh no I think we're done um do you want to read this ending card or should I I can read it if you would like the ending card says start building with Gemini 2. 0 AT aist studio.
goole. com is that correct yeah that was great all right thanks talk to you later next let's take a look at how you can actually use Gemini to natively Output images it's incredible what you're able to do with Gemini's new model so take a look these are examples of native image output in Gemini 2. 0 Gemini 2.
0 can now natively generate images as part of your conversation let's say you want to imagine how it might look to turn this car into a convertible with today's AI tools this takes a lot of work you might have to manually mask out this roof area or write a really complex prompt we brought this image into AI studio and sent it to the new experimental Gemini 2. 0 flash model this is a realtime screen capture and all the prompts you'll see in this video are things that early testers can now try and this is the entire prompt we gave it turn this car into a convertible and it generated this image here they are side by side notice how it successfully modified the car but also managed to keep the rest of the image consistent this is difficult in many AI tools that use separate models but here it's all being done by one model and we continued the conversation saying imagine the car full of beach stuff and change the color to something that feels like summer explain as you go the model began outputting text it explained its idea for a new color then showed it to us but the really neat thing is that the model kept going it went right on generating another image with the car full of beach gear remember that this is all a single response of text and image tokens coming from the model this ability to Output across modalities inter leaving text and image together is one of the most exciting aspects of Gemini 2. 0 here are some more example prompts and outputs all coming from Gemini 2.
0 flash let's say you want to make this photo look a little bit nicer by getting rid of all this stuff on the couch just give it this prompt and the model does that for you you can even combine images you can ask what your cat might look like on this pillow or on this skate board all of this enables you to co-create with AI in new ways we gave this image with this text on the side of the box to the model and we said open the box generate an image looking down into the contents of the Box we sent that prompt to Gemini 2. 0 flash this approach where you send parts of your prompt in the image itself opens up new possibilities for how we could have more seamless multimodal back and forth with the model the model reasoned about the text on the side and generated this image of old electronics you can push this even further co-creating imaginary worlds together we went back to this first conversation in the video we wanted to try communicating visually with the model so we drew a circle on the door handle with the prompt open this the model successfully figured out what we meant and it generated this image with the car door open and for the last prompt we said make the car fly imagine your are the car soaring through the clouds to the beach show what that might look like this is a challenge to see if the model can keep the car consistent while also visualizing this new scene that I'm imagining and the model was able to do it following my instructions generating this image and text now let's take a look at how you're able to use Google Gemini's model to get spatial understanding it's pretty crazy what this model is able to do so I'm going to let this demo show you exactly what's going on these are examples of what you can build with spatial understanding in Gemini 2. 0 we introduced this capability in our 1.
5 models and we've Advanced it even further with Gemini 2. 0 this is a new tool in AI Studio that makes it easier to explore spatial understanding with multimodal reasoning for example you can input this image and prompt it to give you the positions of the origami animals this is a realtime recording and notice how fast the results came back that's because this is running on our new experiment bu Gemini 2. 0 flash model which enables Advanced spatial understanding with low latency you can see if the model can reason about which Shadow belongs to which animal by asking for the fox's Shadow and the model finds it over here or ask it about the armadillo's shadow it finds that too spatial understanding enables new ways to interact with images it lets models do more than generate text about an image Gemini 2.
0 can search within an image you can give it this image and see if it can find both rainbow socks the model finds the matching pair you can even ask it to find the socks with the face and it finds those two this was a particularly neat result because you see the faces are really small and obscured like all models it won't always get everything right so you can try your own prompts to see what works for you you can combine spatial reasoning with multilingual capabilities you can give it an image like this and prompted to label each item with Japanese characters and English translation the model reads the text from the image itself and translates it with spatial understanding Gemini 2. 0 enables AI agents that can reason about the physical world for example you can give the model this photo and ask for the position of the spill but then ask how it would clean it up with an explanation and the model points out the towel over here and with Gemini 2. 0 we're introducing 3D spatial understanding this is a preliminary cap cap ability still in early stages so it won't be as accurate as 2D positions but we're sharing it so that developers can try it and give us feedback here's a collab notebook that lets you prompt the model to give you 3D positions within photos then we visualize those positions in a top- down view essentially turning the photo into an interactive floor plan now the next one is genuinely going to surprise you Google have done an absolutely amazing job at creating the Gemini flash model to be able to natively understand and produce lifelike audio it's pretty incredible so take a look at this demo okay let's talk about Native audio so Gemini 2.
0 introduces multilingual native audio output but maybe you're thinking what exactly is native audio you're actually hearing it right now everything you hear in this video was generated with prompts like you know this actual prompt on your screen like right now it was all generated by prompting the new experimental Gemini 2. 0 flash model like you see in this AI Studio screen cap neat right totally native audio is really really neat it's different from traditional TTS or text to speech systems like what's super cool with Native audio is you can do more than just prompt an AI agent on what to say you can tell it how to say it you can prompt it to just be like dude you know just totally chill or prompt it to speak with oh so very many dramatic pauses and all of this is multilingual you know how when a computer switches languages it sounds like a different voice that's a limitation of traditional TTS but with Native audio in Gemini 2. 0 you can build agents that switch languages more seamlessly check this out okay so I'm starting out speaking English but then and now I'm back to English cool right with Native audio maybe information retrieval could be more expressive like what if AI agents could tell you the weather differently on sunny days maybe they'd sound like this ooh the weather today is 74° and sunny all day awesome but if it's rainy it might sound more like this so the weather today is pretty me drizzly and cold all day oh well or what if AI agent responded dynamically to your context maybe if it seems like you're in a rush your agent can speak really really really really really quickly or maybe if you're Whispering you might want your agent to Whisper back you anyways you get it so many possibilities with Native Audio new output modalities are available to early testers now with a wider roll out expected next year start building with Gemini 2.
0 at aist studio. google. com okay okay wait I know the video is supposed to be over but let me just show you a few more prompts we could do that last line more like start building with Gemini 2.
0 okay start building with Gemini 2. 0 um I guess start building with Gemini 2. 0 yay let's actually take a look at how you can start building with Gemini's tool use in the AI Studio these are examples of what you can build with Native tool use in Gemini 2.
0 Gemini 2. 0 is natively built to use tools like code execution and Google search here's a demo that combines tool use with realtime interaction built using the new multimodo live API make a bar graph that compares the runtime of The Godfather and Oppenheimer add the other two Godfathers to the graph pick three random superhero movies and add them to the graph notice how quickly the model responded that's because this is powered by our new experimental Gemini 2. 0 flash model it's able to search and code while you interact in real time one really neat thing about this demo is how easy it is to set up the graph renderer all you have to do is describe what that tool does like this and the model using the simple description you give it figures out how to render graphs all by itself we've open sourced this demo along with collab notebooks to get you started you can also explore tool use in AI Studio let's say you have to research New York restaurants and you need to look up the phone number address and write a brief description for all of these places we gave this prompt to the 2.
0 flash model and look how quickly it does that for you the model is writing search queries retrieving info and putting it into this table it even provides citations and links so you can explore further you can customize and control how tools are used in your applications you might want to use Google search for some things and not others you can write instructions like these telling it to only use Google search for questions about sports then if you ask something like this the model will answer without using search but if you ask this it'll use Google search to answer just like you instructed it to this ability where the model decides when to use the right tool based on your instructions is one of the strengths of native tool use in Gemini 2. 0 these are just a few examples of how you can build your own multimodal AI agents with Gemini 2. 0 next what we have is something rather fascinating meet Jewels your AI powered code agent imagine your team has just finished a bug bash and now you're stirring down a long list of bugs starting today you can offload Python and JavaScript tasks to jewels an experimental air powerered co- agent that will use Gemini 2.
0 working asynchronously and integrated with your GitHub workflow jws handles bugs and other time-consuming tasks while you focus on what you actually want to build Jules creates comprehensive multi-step plans to address issues efficiently modify efficiently modifies multiple files and even prepares po request to land fixes directly back into your GitHub next what we have here is a fulllength demo of oia using an AI agent built with Gemini 2.