Today, Google has introduced Gemini 2. 0, which is their new AI model for the agentic era. In this video, we'll dive into exactly everything Google has announced and how Gemini 2.
0 is going to completely change how we interact with AI. Gemini 2. 0 enables new kinds of multimodal AI agents—agents that can see and hear the world around you, think, plan, remember, and take action.
Gemini 2. 0 is what enables Project Astra, our research prototype of a universal AI assistant. Using capabilities like multimodal memory and real-time information, it lets you understand your world live.
"What can you tell me about the sculpture? " "The sculpture you're seeing is called ‘My World and Your World’ by Eva Rothchild, located in Lewis Cubit Park in London. " "What kind of themes does her work explore?
" "She creates abstract sculptures that invite viewers to interact with and interpret their surroundings in new ways. " Project Astra is multilingual, using native audio to seamlessly switch languages as you speak. We're taking this to the next step with Project Mariner, with agents that can complete tasks on your behalf.
You can ask it to do things like this that require multiple steps: doing research to look up the artist, then finding the painting, and then shopping for the right supplies. The agent uses the web to take action and find what you're looking for. It plans and reasons at each step, always staying in your control and completing your task.
Gemini 2. 0 enables agents that can be applied across domains. We're exploring agents that can explore virtual worlds, assisting you in video games.
"Where do you recommend I attack from on this base? " "Given the base layout, I recommend attacking from the bottom or south side. This direction allows you to target the town hall directly with your Giants, while the Wizards can handle the surrounding defenses.
" Gemini 2. 0 can reason about the physical world, understanding 3D spatial environments and objects. These capabilities are being applied to robotics, with agents that can assist you in your everyday physical environment.
Learn more about how Gemini 2. 0 is enabling the next generation of multimodal AI agents at DeepMind Google Gemini. Next, let's take a look at a real-world use case on how individuals can use Google's Gemini 2.
0 baked into Project Astra that enables agentic capabilities such as tools, memory, and multimodal reasoning. "All right, we've got a Pixel phone loaded with the latest test build of Project Astra, and we're going to leave the office to test it around London. I've got this email with apartment info; can you tell me the door code and remember it?
" "I'll remember the door code for you. It is 1170. To unlock it, enter the code and then press the unlock button.
" "Based on these symbols, how should I wash this? " "The tag shows to machine wash at 30° Celsius, avoid bleach and tumble drying, iron on low, and dry cleanable. " "Then what setting should I use on this machine?
" "Based on the laundry instructions of 30° C, select the 30 setting under Easy Care on the dial. " "I've got this list of recommendations; can you tell me a bit about some of them? " "I can search for these places for you.
Coal Drps Yard is a shopping destination with unique industrial design, Luminary Bakery is a social enterprise bakery, and Tate Modern is a modern and contemporary art museum. " Now we're going to try Project Astra out on the street. Let's see what it can do.
"I've always said scone, but now I'm hearing people say scone. Can you help me with the pronunciation? " "It's a fascinating topic.
The pronunciation of scone—both ‘scone’ rhyming with ‘gone’ and ‘scone’ rhyming with ‘bone’—are accepted pronunciations, though regional variations exist. " "What can you tell me about the sculpture? " "The sculpture you're seeing is called ‘My World and Your World’ by Eva Rothchild, located in Lewis Cubit Park in London.
" "What kind of themes does her work explore? " "She creates abstract sculptures that invite viewers to interact with and interpret their surroundings in new ways. " "Uh, would these do well if I planted them back home?
" "Mountain fleece can grow well in New York; they prefer full sun to partial shade and moist, well-drained soil, so your garden in New York should be a good fit for these flowers. " "I'm trying to get a gift for my friend, and this is everything she's been reading. How would you describe her taste in books?
" "Based on these titles, your friend seems to enjoy a mix of history, biography, and perhaps a touch of political or social commentary. I picked some of these out; any idea which one she'd like the most? " "Consider ‘Half of a Yellow Sun’ for its exploration of complex social and political issues within a historical setting, aligning with her interest in narratives that offer insight into significant periods.
" "Will that best take me anywhere near Chinatown? " "Yes, it does. The 24 bus route goes through Leicester Square, which is very close to Chinatown.
" "Are there any landmarks on the way? " "Riding the number 24 bus, notable landmarks you will likely encounter are Westminster Abbey, Big Ben, and Trafalgar Square. " I also met up with some friends to test out the multilingual capabilities now.
I'm using these prototype glasses to test Project Astra hands-free. "Can you check if it's going to rain anytime soon? " "It looks like it is currently mostly sunny in London.
Rain is expected tonight from around 10 p. m. until tomorrow morning at 6 a.
m. " "Hey, what's that park next to me? " "The park looks to be Primrose Hill, known for its panoramic city views and the notable Shakespeare's Tree.
" "Can you check if I can bike in there? " "It appears that cycling is actually banned within Primrose Hill itself, although it is permitted in nearby areas and the rest of Regent's Park. " "I'm biking back to Camden; can you check if there are any supermarkets?
" Along the way, there's a Saints faes on Camden Road, a Morrison on Chalk Farm Road, and an M&S Simply Food on Captain High Street. What was the door code again? The door code you saved previously is 1170.
Oh, thanks, that works! Next, let's take a look at Google's Project Mariner. This is how you can actually take action in your Chrome browser with an AI agent today.
I want to tell you about Project Mariner; it's a research prototype exploring the future of human-agent interaction, and it's built on Gemini 2. 0. Like with all new technology, it's important for us to build this responsibly, which is why we're starting small.
We'll be getting feedback from a group of trusted testers and using their experiences to really shape how Project Mariner evolves. Let me show you how it works. So, Project Mariner works in the browser as an experimental Chrome extension.
I'm going to start by entering a prompt here: I have a list of outdoor companies listed in Google Sheets, and I want to find their contact information. So, I'll ask the agent to take this list of companies, then find their websites and look up a contact email I can use to reach them. This is a simplified example of a tedious multi-step task that someone could encounter at work.
Now, the agent has read the Google Sheet and knows the company names. It then starts by searching Google for Benchmark Climbing and now it's going to click into the website. You can see how this research prototype only works in your active tab; it doesn't work in the background.
Once it finds the email address, it remembers it and moves on to the next company. At any point in this process, you can stop the agent or hit pause. What's cool is that you can actually see the agent's reasoning in the user interface, so that you can better understand what it is doing, and it will do the same thing for the next two companies, navigating your browser, clicking links, scrolling, and recording information as it goes.
You're seeing an early-stage research prototype, so we sped this up for demo purposes. We're working with trusted testers to make it faster and smoother, and it's so important to keep a human in the loop. After the fourth website, the agent has completed its task, listing out the email addresses for me to use.
And there you have it! We're really just scratching the surface of what's possible when you bring agentic AI to computers, and we're really excited to see where this goes. Next, let's actually take a look at how you can use Project Mariner to do different things, such as online shopping and interacting with the web more efficiently.
Today, I'm going to walk you through a demo of Project Miner, a research prototype we built with our new Gemini 2. 0 models. Let me show you how it works.
I'm going to start in the Gemini app, where Project Miner is an experimental Chrome extension that works across all web pages. In this demo, I'll be prompting the agent to find the most famous post-impressionist, find a colorful painting of theirs on Google Arts and Culture, then add some colorful paints to my Etsy cart. In Gemini, the agent starts by asking who the most famous post-impressionist is.
Gemini quickly identifies Van Gogh as the most famous post-impressionist painter. From here, it navigates to Google Arts and Culture by typing in the URL into the address bar. You can see how this research prototype only works in your active tab; it doesn't work in the background.
It opens the search bar to look for Van Gogh. Van Gogh is the first option, so it clicks on it. Now, it decides to click on the audit page to look for a colorful painting.
We're able to see the agent's reasoning in the user interface throughout the process, so we can easily follow along to understand what it's planning. The agent realizes that the page only contains information about Van Gogh, so it decides to scroll down to look for artwork. It discovers a collection of art and starts looking at them to find a colorful option.
The agent used its multimodal understanding capabilities to find a colorful painting of irises and clicks on it. After finding this painting, it navigates to Etsy by typing in the URL. It clicks on the search bar and searches for colorful paints.
So now it's looking at different paint sets, and the agent gives us a sense of what it's optimizing for. As you can see, it's optimizing for price and visual appeal. The agent clicks on a set of watercolors that appears to be a good fit.
It's then added to cart. After a moment, the agent asks me if it should proceed to check out. This is an early research prototype, but it's a great example of how we are keeping a human in the loop and ensuring you're always in control.
I'll just respond by saying there's no need to check out, and now it completes its task and hands back control. Project Mariner is now available to a select group of trusted testers as an experimental Chrome extension. They're giving us feedback to help us build this safely and responsibly.
We are excited to see where this technology goes and what people create with it. Next, we have Gemini 2. 0 Flash, and Gemini 2.
0 Flash builds on the success of 1. 5 Flash, the most popular model for developers. Now, this is absolutely incredible because this model is fast.
Notably, they state that 2. 0 Flash even outperforms 1. 5 Pro on key benchmarks and at twice the speed.
2. 0 Flash comes with new capabilities in addition to supporting multimodal inputs like images, video, and audio. 2.
0 Flash now supports multimodal output, like natively generated images mixed with text and steerable text. To speech multilingual audio, it can also natively call tools like Google Search, code execution, as well as third-party user-defined functions. Google's goal is to get these models into people's hands safely and quickly, and over the last months, they've been sharing early experimental versions of Gemini 2.
0, getting incredible feedback from developers. Gemini 2. 0 Flash is now available as an experimental model, and I'm going to show you guys exactly what this model is capable of.
One of the first things I want to show you guys is Gemini's ability to have live streaming. Take a look at this demo. Okay, this is a quick demo of multimodal live streaming in Gemini 2.
0, and we are casting a tab into AI Studio. Um, can you see my screen? Yes, I see your screen.
It's showing a document with the title "Demo. " It also contains bullet points and a Google Meet window with a person in the video. Is there anything in particular you would like to talk about on the screen?
Yeah, can you start the demo by just reading the text I highlighted? Sure, I can do that. The text you highlighted reads: "The multimodal live API lets you build real-time multimodal applications powered by Gemini 2.
0 Flash. It features tool use like Google Search, code execution, function calling, and more. " Would you like me to read the following lines too?
No, can you actually just tell me what this word means, though? I can do that. Multimodal refers to the ability to process and understand different types of data, like text, images, and audio all together.
Does that help? Yeah, um, I need to demo the interruptions feature next. Can you tell me a really boring kind of "Once Upon a Time" sort of story?
Okay, I can do that. Once upon a time, in a small village nestled beside a quiet forest, there was a very unremarkable rock. This rock was gray, and it was round, and it sat there.
I to interrupt you. No problem, I understand that interruptions are part of the demo. What should I do now?
Um, so we're going to demo memory. Can you just summarize everything we've seen or heard so far? I can do that.
So far, we've discussed a demo of the multimodal live API, which lets you build real-time multimodal apps. I defined what multimodal means. I was also asked to tell a very boring "Once Upon a Time" story, but I was interrupted.
Is there anything else? Uh, no, I think we're done. Um, do you want to read this ending card, or should I?
I can read it if you would like. The ending card says: "Start building with Gemini 2. 0 at aistudio.
google. com. " Is that correct?
Yeah, that was great. Alright, thanks. Talk to you later.
Next, let's take a look at how you can actually use Gemini to natively output images. It's incredible what you're able to do with Gemini's new model, so take a look. These are examples of native image output in Gemini 2.
0. Gemini 2. 0 can now natively generate images as part of your conversation.
Let's say you want to imagine how it might look to turn this car into a convertible. With today's AI tools, this takes a lot of work; you might have to manually mask out this roof area or write a really complex prompt. We brought this image into AI Studio and sent it to the new experimental Gemini 2.
0 Flash model. This is a real-time screen capture, and all the prompts you'll see in this video are things that early testers can now try. This is the entire prompt we gave it: "Turn this car into a convertible," and it generated this image.
Here they are side by side. Notice how it successfully modified the car but also managed to keep the rest of the image consistent. This is difficult in many AI tools that use separate models, but here it's all being done by one model.
We continued the conversation, saying, "Imagine the car full of beach stuff, and change the color to something that feels like summer. Explain as you go. " The model began outputting text.
It explained its idea for a new color, then showed it to us. But the really neat thing is that the model kept going; it went right on generating another image with the car full of beach gear. Remember that this is all a single response of text and image tokens coming from the model.
This ability to output across modalities, interleaving text and image together, is one of the most exciting aspects of Gemini 2. 0. Here are some more example prompts and outputs, all coming from Gemini 2.
0 Flash. Let's say you want to make this photo look a little bit nicer by getting rid of all this stuff on the couch. Just give it this prompt, and the model does that for you.
You can even combine images; you can ask what your cat might look like on this pillow or on this skateboard. All of this enables you to co-create with AI in new ways. We gave this image with this text on the side of the box to the model, and we said, "Open the box.
Generate an image looking down into the contents of the box. " We sent that prompt to Gemini 2. 0 Flash.
This approach, where you send parts of your prompt in the image itself, opens up new possibilities for how we could have more seamless multimodal back-and-forth with the model. The model reasoned about the text on the side and generated this image of old electronics. You can push this even further, co-creating imaginary worlds together.
We went back to this first conversation in the video. We wanted to try communicating visually with the model, so we drew a circle on the door handle with the prompt, "Open this. " Model successfully figured out what we meant, and it generated this image with the car door open.
For the last prompt, we said, "Make the car fly. Imagine you are the car soaring through the clouds to the beach. Show what that might look like.
" This is a challenge to see if the model can keep the car consistent while also visualizing this new scene that I'm imagining. The model was able to do it, following my instructions and generating this image and text. Now, let's take a look at how you're able to use Google Gemini's model to get spatial understanding.
It's pretty crazy what this model is able to do, so I'm going to let this demo show you exactly what's going on. These are examples of what you can build with spatial understanding. In Gemini 2.
0, we introduced this capability in our 1. 5 models and we've advanced it even further with Gemini 2. 0.
This is a new tool in AI Studio that makes it easier to explore spatial understanding with multimodal reasoning. For example, you can input this image and prompt it to give you the positions of the origami animals. This is a real-time recording, and notice how fast the results came back.
That's because this is running on our new experiment, the Gemini 2. 0 flash model, which enables advanced spatial understanding with low latency. You can see if the model can reason about which shadow belongs to which animal by asking for the fox's shadow, and the model finds it over here.
Or ask it about the armadillo's shadow; it finds that too. Spatial understanding enables new ways to interact with images. It lets models do more than generate text about an image.
Gemini 2. 0 can search within an image. You can give it this image and see if it can find both rainbow socks; the model finds the matching pair.
You can even ask it to find the socks with the face, and it finds those too. This was a particularly neat result because you see the faces are really small and obscured. Like all models, it won't always get everything right, so you can try your own prompts to see what works for you.
You can combine spatial reasoning with multilingual capabilities. You can give it an image like this and prompt it to label each item with Japanese characters and English translations. The model reads the text from the image itself and translates it.
With spatial understanding, Gemini 2. 0 enables AI agents that can reason about the physical world. For example, you can give the model this photo and ask for the position of the spill, then ask how it would clean it up with an explanation.
The model points out the towel over here. With Gemini 2. 0, we're introducing 3D spatial understanding.
This is a preliminary capability still in early stages, so it won't be as accurate as 2D positions, but we're sharing it so that developers can try it and give us feedback. Here's a Colab notebook that lets you prompt the model to give you 3D positions within photos; then we visualize those positions in a top-down view, essentially turning the photo into an interactive floor plan. Now, the next one is genuinely going to surprise you.
Google has done an absolutely amazing job at creating the Gemini flash model to be able to natively understand and produce lifelike audio. It's pretty incredible, so take a look at this demo. Okay, let's talk about native audio.
Gemini 2. 0 introduces multilingual native audio output, but maybe you're thinking, "What exactly is native audio? " You're actually hearing it right now.
Everything you hear in this video was generated with prompts like the one currently on your screen. It was all generated by prompting the new experimental Gemini 2. 0 flash model, as you see in this AI Studio screen cap.
Neat, right? Totally! Native audio is really, really neat.
It's different from traditional TTS (text-to-speech) systems. What's super cool with native audio is you can do more than just prompt an AI agent on what to say; you can tell it how to say it. You can prompt it to just be like, "Dude, you know, just totally chill," or prompt it to speak with "oh so very many dramatic pauses.
" And all of this is multilingual! You know how, when a computer switches languages, it sounds like a different voice? That's a limitation of traditional TTS, but with native audio in Gemini 2.
0, you can build agents that switch languages more seamlessly. Check this out. Okay, so I'm starting out speaking English, but then.
. . and now I'm back to English.
Cool, right? With native audio, maybe information retrieval could be more expressive. Like, what if AI agents could tell you the weather differently on sunny days?
Maybe they'd sound like this: "Ooh, the weather today is 74° and sunny all day, awesome! " But if it's rainy, it might sound more like this: "So, the weather today is pretty me, drizzly and cold all day. Oh well.
" Or what if an AI agent responded dynamically to your context? Maybe if it seems like you're in a rush, your agent can speak really, really, really, really quickly. Or maybe if you're whispering, you might want your agent to whisper back.
You see what I mean? So many possibilities with native audio. New output modalities are available to early testers now, with a wider rollout expected next year.
Start building with Gemini 2. 0 at aiststudio. google.
com. Okay, okay, wait! I know the video is supposed to be over, but let me just show you a few more prompts we could do.
That last line? More like, "Start building with Gemini 2. 0.
" Okay, start building with Gemini 2. 0. Um, I guess, start building with Gemini 2.
0. Yay! Let's actually take a look at how you can start.
Building with Gemini's tool use in the AI Studio, these are examples of what you can build with native tool use in Gemini 2. 0. Gemini 2.
0 is natively built to use tools like code execution and Google Search. Here's a demo that combines tool use with real-time interaction, built using the new multimodal live API: make a bar graph that compares the runtime of The Godfather and Oppenheimer. Add the other two Godfathers to the graph, pick three random superhero movies, and add them to the graph.
Notice how quickly the model responded—that's because this is powered by our new experimental Gemini 2. 0 flash model. It's able to search and code while you interact in real time.
One really neat thing about this demo is how easy it is to set up the graph renderer. All you have to do is describe what that tool does like this, and the model, using the simple description you give it, figures out how to render graphs all by itself. We've open-sourced this demo along with collab notebooks to get you started.
You can also explore tool use in AI Studio. Let's say you have to research New York restaurants and you need to look up the phone number, address, and write a brief description for all of these places. We gave this prompt to the 2.
0 flash model, and look how quickly it does that for you. The model is writing search queries, retrieving info, and putting it into this table. It even provides citations and links so you can explore further.
You can customize and control how tools are used in your applications. You might want to use Google Search for some things and not others. You can write instructions like these, telling it to only use Google Search for questions about sports.
Then, if you ask something like this, the model will answer without using search, but if you ask this, it'll use Google Search to answer, just like you instructed it to. This ability, where the model decides when to use the right tool based on your instructions, is one of the strengths of native tool use in Gemini 2. 0.
These are just a few examples of how you can build your own multimodal AI agents with Gemini 2. 0. Next, what we have is something rather fascinating: meet Jules, your AI-powered code agent.
Imagine your team has just finished a bug bash, and now you're staring down a long list of bugs. Starting today, you can offload Python and JavaScript tasks to Jules, an experimental AI-powered co-agent that will use Gemini 2. 0, working asynchronously and integrated with your GitHub workflow.
Jules handles bugs and other time-consuming tasks while you focus on what you actually want to build. Jules creates comprehensive multi-step plans to address issues efficiently, modifies multiple files, and even prepares pull requests to land fixes directly back into your GitHub. Next, what we have here is a full-length demo of an AI agent built with Gemini 2.
0 to play the game Squad Busters. This is a real-time, unedited interaction between the gamer and an AI agent. The agent's responses are based on the video and audio feed from the user, along with information retrieved from the internet.
No specific post-training or game integration has taken place. "Hey, listen, I'm about to start a game, and I want to make sure I complete the weekly quests. Can you see them?
" "Yes, I can see them. Looks like you need to pick up 300 gems and dust 10 boss monsters. You've already got a head start on the 300 gems.
" "Cool, I'll ask you to remind me about it during the game 'cause I keep forgetting about it. " "Absolutely! I'll keep an eye out and remind you about the quests.
" "Thanks. " "You got it! Just let me know if you need a reminder.
" [Music] "Anytime! Um, hey, I'm about to pick a character. Can you do a quick search on Reddit?
What is the meta, as in like, what are the best characters? " "Let me look that up for you. I'll let you know what I find.