back in December Google announced Gemini 2. 0 but so far we only had access to the flash members of the family but today they announced the Gemini 2. 0 pro experimental this is their most powerful model and I was fortunate enough to get Early Access in this video I'm going to be sharing some of my very early tests and thoughts about the model the model is multimodel from the ground up so it has native image generation and image understanding capabilities they can also process text to speech and con natively use tools however with this initial release they're only enabling text outputs and even though it's not specifically a reasoning model but based on my initial tests it's really good at reasoning tasks and probably one of the best models that I have seen although it does need a little bit of help at the time of the recording I did not have um access to the benchmarks but I those are going to be state-ofthe-art I want to show you some of the capabilities of the model and my initial tests so let's get started at the time of the recording of this video the model is listed as Gemini 2.
0 pro experimental 0205 the later part refers to the expected release date and as the name suggests it's still an experimental model The Flash already has a stable version which is is called Gemini 2. 0 flash that is available within the Gemini app so hopefully we're going to have a stable version of the pro as well it has up to 2 million uh tokens context window which is pretty substantial and is one of the longest in the industry right now it also supports structured outputs that you can enable or disable it also supports code execution basically the uh API can generate and execute python code behind the API and use those results when it's generating a response this is a feature that is exclusive to the Gemini API at the moment I haven't seen this from any of the other providers and I think it's one of the best features especially if you want grounded answers and the output currently is limited to 8,000 tokens now it also has the ability to do Native function call that means it could be really good for agentic use I had earlier access to the same model but with a different name at that time it didn't had the grounding with Google search enabled but seems like now you'll be able to ground your answers within Google search so what it means is that you can enable Google search and then it will use Google search results to augment its answers here's an example where Google search can actually help so I asked it to look for the um vrm of the latest RTX gpus now in this case it used Google search to figure out that the latest RTX GPU is 50 Series and then it compared it with the previous generation which is the 40 series you can look at the Search terms that it used so it used the latest RTX gpus V and then RTX 40 series vram versus 30 series I think this is coming from the training dat whereas this term gives us the most up toate information and these are the two sources it used to generate the final answer it's not a reasoning model so it will not be able to reason like the previous models next I'm going to show you some quick tests and uh one surprising thing that I noticed was that even though it's not a reasoning model but adding a system prompt can really increase its reasoning capabilities I'll show you a few examples later in the video this is also the first model that was able to solve almost all of the misguided attention prompts with a little bit of help but first let's look at some coding prompts so first we're going to start with some of our coding tests now the first one that I tried was to write a python script for a bouncing ball the color is supposed to be red and within a triangle and then it needs to handle Collision detection properly and it also needs to make sure that that the ball stays within the triangle this is currently a viral prompt on X and Reddit as well now the speed of generation of Gemini 2. 0 pro experimental is pretty great if you look at the speed at which is generating the responses so it generated about 32 100 tokens and it took about 24 seconds now the results are interesting you can see the ball sticks Within in the Triangle but then after sometimes it just goes out of the triangle and I have seen this to be a problem with a lot of the models for this specific test for my next PR I asked to use JavaScript to create an animation of falling letters with realistic physics this is a lot more complex scenario compared to what we saw before the letters should randomly appear at the top they needs to be falling under Earth's gravity we need to have Collision detection there should be interaction between the ground the screen boundaries and other letters and it needs to dynamically adopt to the screen size changes and the background is supposed to be black so here is the code that it generated and you can see that the results are pretty good there are letters falling in a random order and they seems to be interacting with both the ground as well as with the other letters now if I try to adjust my screen it actually dynamically adjust to that as well seems like all the requirements are being fulfilled which is pretty great so you have random letters following they're interacting with the ground with other letters as well as with the screen boundaries I use the same prompt with O3 mini on high capacity and it wasn't able to do it it used a whole bunch of external packages but in this case the Gemini 2.
0 pro experimental 0205 is able to do this in single chot although I think Google really needs to work on these naming conventions my other programming test was to ask it to create a HTML page again it needs to have a button is supposed to put random jokes on the page and then randomly change the background color and also add a random animation so this is a relatively simple task by this point for any llm and it was able to do this without any issues so here's how the output looks like for that specific task if I uh click on this run button it shows me a random joke I would love for it to be included in something like deep research especially if it has a thinking or reasoning version as you will see in the next section with a little bit of prompting it can do really good reasoning now is if you ask any LM to count the number of RS in Strawberry it will most probably give you that there are three Rs however I added an additional R so that is a typo but it still went back to the original spelling right so in in reality the answer is incorrect however I said I use code if needed and I enabled code execution because the Gemini model models can use a python interpreter behind their API now in this case it decided to write python code and run it and the code resulted a value of four so now it's using the correct spelling and it says my apologies the code tool has correctly computed that there are four RS in the strawberry I had previously missed account code execution is a very powerful feature that is currently only available in Gemini I wish the other models providers will also Implement something like this and you can not only use this within the AI Studio but it's also available through the API okay next I want to test the reasoning capabilities of Gemini uh 2. 0 pro we're going to be using the misguided attention repo if you haven't seen some of my previous videos these are commonly known thought experiments or trick questions the only difference is that there are slight variations in the wording which really changes the meaning so if the model is smart enough then it will it will be using logical deductions to come up with answers and like however many of the LMS that I have tested there will simply go back to the unmodified original problem because it has seen that a lot of times in the training data now the first one is going to be a modified version of the trolley problem and in this case there are five dead people I kept the temperature to one so that the model has creative freedom and let's see what it comes up with okay so it says this is the classic formulation of try problem a thought experiment in ethics designed to explode conflict between different moral principles and there is no single right or the best stel just like other llms it goes and presents point of view from different Frameworks now without realizing that the people are already dead however you can fix this with a very simple prompt and that is I asked it to read the user question and print it as it is before generating the answers and pay close attention things might not be what they look like so with the same prompt and this small additional system message now again it gives us the same different principles however at the end it says the question of the people the question states that the five people are dead and the one person is alive therefore pulling the liever kills the one person and not pulling the L does not save anyone therefore you should not pull the L so in conclusion because the five people are already dead there is no benefit of pulling the lever and it will result in the death of one person therefore the L should not be pulled right so you can see that even such a big model can use a little bit of directions and prompt engineering now I see a similar pattern of the effect of prompt engineering on some of the other paradoxes so for example here a modified version of the Monty Hall problem and the main difference is that in this case when you choose a door for example you pick door number one Monty actually opens the same door and then you pick another door and then Monty gives you the option of uh switching that door right so in the original setup uh it says that switching gives you uh 2 X3 chance of winning the car if you are looking at the unmodified Mony Hall problem and um Gemini 2. 0 pro actually just goes on and goes through the same logic as the original problem and gives you the same probability for the original multi problem switching actually is better however for this setup uh it doesn't make any difference and you can see with this uh simple system prompt the thought process is completely different and it comes up with the correct conclusion so it says because of this change in the setup the question presented is now 50/50 short and it doesn't matter if you switch the door number two or stick to door number three so as it seems like prompt engineering or system prompt actually has an impact when it comes to Gemini 2.
0 pro so I'll quickly look at some of the other examples so here's the modified version of the Russell Paradox now the main difference is that the simple rule that the barber has that he shares all the men in the town who visit him without system prompt it actually goes back to the unmodified version and it picks that rule which is he shares all the men in the town who do not shave them themselves all right so you can really answer based on this Rule now in this case after print the Paradox or the original question it says this is a slight modified but flawed version of the classic Russell Paradox which is more itself a modified version of the restle Paradox in the theory the original Paradox presents a contradiction while this does not let me examine the prompt and it correctly identifies that the barber shaves all the men who visit him this phrasing present no contradiction since this is a choice of the barber to visit himself therefore if the barber shaves himself he's a man that wiards him and is within the roots right and then it talks about what is going to happen if we were to stick to the classic Paradox for the dead cat in a box which is a modified version of this true dangerous cat thought experiment actually Gemini 2. 0 pro already figured out that the cat is already dead there therefore the probability of it being alive when you open the box on the next next day is going to be zero with that system prompt again it's able to identify that the cat is already dead and it says since the cat is already dead when it's placed in the Box the state of radioactive isotope the poison and the detectors are completely irrelevant the cats state is already determined so then it talks about the original Shing cats thought experiment right so you can see that with a small nudge it is able to pay a lot more attention to the wording and in a single shot it's able to figure out the variations or modifications that are introduced and how it impacts the final output so there's another one which is you have six lit and 12 L jugs and you want to measure exactly six daters right so with this modified or with the additional system PR it basically tells you the answer that we looking for however with the unmodified version it says fill the 12 lit just completely then you have to pour it into 6 lit until it's fully filled now in this case with the system problem it says that's it you have exact excess dater of water in The Jug when you use the six lit jug the 12 lit jug is completely irrelevant for the farmer problem who has a wolf a goat in a cabbage but we are only interested in the goat to be transported to the other side Gemini 2. 0 pro again comes up with a complex number of steps so you take the goat across return a loan now you choose whether you want to take a wolf for cabbage but the problem is we are not interested in the remaining steps at all we just want to transport the goodat however with the addition system prompt it says this is the classic wolf goat and cabbage River Crossing puzzle the question has posed however is trivially easy focused on just the goat the full problem includes getting all three across let's answer the specific question then explain the full solution for the context right so the answer to this specific question is the farmer simply needs to take the good across the river on the first uh trip since the farmer is only asked to transport the good safely and nothing else this fulfills the requirement right extremely smart I haven't seen any other model actually give me this answer and then it goes on to give us a full solution in case if you're trying to solve the original problem so this is probably one of the most impressive models that I have seen on the misguided attention and for me it's essentially solof even though we had to provide it with uh a little bit of instructions but I can imagine if uh Google decided to create a thinking version of the Gemini 2.
0 pro that probably is going to be able to solve these misguided attention problems out of the box but I'm going to say I am impressed when it comes to reasoning with a little bit of help okay so I haven't really seen the uh benchmarks yet but as I always recommend make sure to test this on your own set of problems before you decide whether it's a good model or not your tests might give you a lot different results than what I'm seeing here but based on my quick experimentations I I still think that the uh flash 2.