this year is going to be the year of reasoning models and we have yet another reasoning model well this is an update to the flash thinking which gives it a substantial boost on mathematics and Sciences and now this announcement was made by deis hassabis in his tweet on X so apart from the substantial performance boost where it goes on almost 65% all the way up to 75% on mathematics benchmarks and from 67 all the way up to 64% there are some other features as well and these primarily include a long context window so they extended
the context window from 32,000 tokens all the way up to 1 million token now this also natively support code execution which is something that only Google currently provides so basically the model can execute code behind an API and use that for its reasoning capabilities now the output are going to be much larger in terms of the number of tokens that it can generate and there are going to be less frequent model contradictions so we should expect better performance on math sciences and multimodal reasoning and they seems to be working already on a newer version and
according to Logan scaling is alive and well so there's a lot that can be done with test time scaling and seems like that's a a general direction that every company is taking with these new updates this is the the top ranking model on chatbot Arena leaderboard as well surprisingly they didn't share any other benchmarks although this is sub substantial Improvement uh from the previous version it's still lack behinds on mathematics compared to the Deep seek R1 that was released a couple of days ago okay since it's a reasoning model I'm going to test it on
misguided attention and some coding problems now the model is available both on the API as well as on Google AI studio so we going to select that from the drop down menu it's best for multimodal understanding reasoning and coding and different use cases that they um recommend is reason or the most complex problems show the thinking process of the model and Tackle difficult Cod and uh mathematical problems and it has a pretty uh recent cut off training date as well so it should be pretty up to date if you look at the number of tokens
it has about 1 million tokens context window and seems like uh the maximum number of tokens that you can Define in the output are much larger right now it's set to about 65,000 tokens which is pretty awesome compared to the other Foundation models you can also enable code execution so this should work or help with problems in which writing code can be helpful now if you test something like how many hours or in the word strawberry as so probably this has seen it in the training data by now so it's says there are three Rs
but let's see if we enable code execution so I would suspect that it will probably um write a python code because the original yeah it did write the python code right so in like if you enable code execution for problem like these uh you can see that it's will use Python code to basically uh make sure the the answer is correct which makes sense right so code execution behind the API can be extremely helpful because then um for problems that can be solved through code these models can use code to come up with more correct
answers okay but how good is this uh when it comes to misguided attention if you haven't seen um some of my previous videos on misguided attention I highly recommend so just to give you a quick overview these are a set of prompts which are based on well-known uh paradoxes and problems but with slight modifications and the a is if the llm is actually smart it should be able to pick up those slight modifications and use its logical deduction rather than its trending data so the prompts that I usually test doesn't really give you the model
capability on hard mathematics or science problems it's just how good they are at picking these Nuance changes in the prompt and the best model based on my testing so far is deep seek R1 on uh these specific types of problems I recently created a video on that so I highly recommend to check that out okay so we're going to start with this it's basically a modified version of the trolley problem but instead of of um five alive people now there are five dead people on the track let's see what it comes up with deep seek
was able to U identify that the people are already dead and it was able to formulate an answer B based on that observation now it shows the internal thought process very similar to the previous version so it starts off by saying the user is asking a variation of the classic trolley problem this is a thought experiment in ethics and morality I need to explain the Dilemma present the two main ethical Frameworks to apply and then discuss common responses and criticisms right so the plan of action is identify the code dilemma introduce these different Frameworks then
discuss common responses acknowledge the limitations of the trolley problem and then provide a conclusion now here is the output that the user is supposed to see so it says this is a classic TR problem with a thought experiment in ethics and morality there's no single right answer as uh it's designed to make you think about different ethical principles right and then uh it talks about the uh dilemma basically the two Frameworks that uh can be applied right now seems like it completely missed on the part that the people that are already dead so I would
say it's a fail okay let's ask it what is the status of the people on the track and let's see if it can actually figure out that the people are already dead so it says here's the thinking process to arrive at the detail answer about the status of the people it says Identify the code question the user is asking about the status of the people on the track this is intentionally vague and my first step is to recognize the vagueness and explore what status could mean in this context right so I think it's just overthinking
but let me look at the final answer so it says aha that's a crucial clarification you have pointed out a potential misunderstanding in your initial prompt let's address the correct status of the people in the classic try problem because you wording five dead people is actually the opposite of the D setup it's um thingss that made a mistake in the wording and it's still trying to explain the original C problem now one thing which it does say that if the people if the five people were already dead there would be no ethical dite at all
you would obviously pull the lever to save the living person this is a modified version of the famous Monty Hall problem and the idea over here is that you change the propt in a way that the final probability is going to be just 50% now let's run this and let's see what it comes up with I think it again went back and tried to solve the ud original monal problem so one way you would force the model uh on extra thinking is just add system instructions but I want to see how how smart the model
is without any specific directions can it identify that there are actually variations in the prompt and that's why I'm not adding any additional instructions and I haven't done that with other models as well so here it says let's break down the classic game shoot dilemma the monal problem it's trickier than seems at the first clance sign so understanding the setup that are three doors behind one door is a car the price you want and behind the other toos uh are codes which are undesirable it says this is a slightly unusual variation of the classic Monty
Hall problem where monly opens a door that you did not pick so it's good that it actually realize that there is a slight variation now let's assume this is still meant to be the core puzzle just word is slightly different if you opened your door and show a good it implies the car must be behind door number two and three then you pick door number three and montino offers you a chance to stick to door number three or switch to door number two so so far it's following the prompt that we provided let's see what
happens next it says when you initially chose door number one you had one by three chance of picking the door with the car this also means there are 2x3 chances that the car will behind one of the two other doors Mon's action is not random in this case the problem says he opened door number one and show it good this is a bit unusual for classic problem but we will uh work with it okay then it says uh let's rethink the classic Monty Hal scenario slightly to better fit the promps wording and understand the logic
in the more typical Monty Hall problem so basically you don't you pick one number one but he opens a different uh door so let's say door number two now this is just going back to to the original problem just like the try problem okay and it end up uh solving the original monol problem rather than this modified version I had to say assume that the wording of the given problem is correct how would you solve this now in this case it comes up with the correct answer that there's a 50/50% chance so it doesn't really
matter whether you switch the door or not next we're going to use the normal bar paradox so this is a modified version of the Russell Paradox and in that case the barber has a rule that he doesn't shave people who do not shave themselves but here he shaves all the men in the town who visit him and the question is does the barber shave himself so let's see it says this is a classic kiddle that plays on wording here's how to break it down so understanding the rule he shares all the men in the town
who visit him this means if a man in town goes to the barber for a shave the Barber will shave him right and the question is does the same principle applies to the barber is he going to shave himself so far this is correct I didn't already see that falling back to the original Paradox this is probably one of the best answers that I have seen so if you look at this it says the riddle is designed to be be a bit tricky the most common and intendent answer is that no the barber does not
shave himself so then it says the rule is about customers the rule describes who the barber shaves as customers it's about his service to others now visiting a customer so visit him implies going to a barber as a client to receive a service or a shave the barber in this role is the barber isn't is not visiting himself in that customer sense he is his own barber shop right now the way the example that it presented is this imagine the rule was he cooks dinner for all the people in the town who comes to his
restaurant so does the chef cook dinner for himself yes likely but not because he is visiting his own restaurant as a customer he cooks for himself as a part of his job and personal needs not because he's fulfilling the condition of visiting his restaurant right so it's paying attention to uh this part which is basically the people who are his customers right and kind of makes sense as well but in general he will be able to uh shave himself because there is no other rule which actually stops him from doing it okay so we are
actually going somewhere with the model at least on that last one I think it had a pretty good reasoning capabilities now here's another variation which is a modified version of the shinger cat Paradox but in this case the cat is already dead and it seems like it picked on that so it says this is a bit of a trick question playing on the famous shinger cat prodct let's break it down starting condition the cat is dead when it's placed in the box now the key difference in this scenario the cat is already dead nothing that
happens inside the Box can change that fact whether the isotope decays or the detector triggers or the poison is released right so really good that it's uh able to figure out that the cat is already dead now the temperature is pretty high which is 7 I didn't want you to keep it to um zero because because in those cases it most probably going to just stick to the original version of these PRS or paradoxes so having a higher temperature can definitely help we're going to uh try a couple more before we look at a couple
of coding examples so in this case we have a farmer who has a wolf goat in a cabbage and we only want him to transfer the goat we don't really want um want to worry about the wolf and the Cabbage now it again came up with pretty complicated steps or sequence of steps which is not needed so it should be able to just say that okay take the go across and then stop there right but like the other models that have tested including o1 it came up with an unnecessary uh complicated framework now what about
this one this should be a relatively straightforward so we have a six and 12 letter Dr and we want to measure exactly six letters now most of the other llms will also come up with a complicated sequence which is not in needed whatsoever now let's see this is a classic water jug puzzle here's how you measure exactly 6 lit using a 6 lit and 12 lit jug so fill the 12 lit jug pour the water from the 12 lit jugs into the 6 lit until 6 lit jug is full this will use six lit from
the water jugs and you have exactly 6 lit remaining right now this is a way of doing it or or you can just uh fill the six letter J directly I think R1 was uh really the first model which gave me two solutions one was basically this and the other one was to directly fill the six lit J next we're going to look at a couple of coding examples I think I'll have to update these because some of them are pretty easy by now so this is basically create a web site that has a click
me button and it's supposed to randomly change the background color as well as show us a random animation and a joke so it's thinking for quite a while I think it thought for about 6 seconds and here's the code that it came up with now once the thinking process is complete the rest of the generation is extremely uh fast okay so here's the code and if I click on the button it does seem to be working so pretty nice the second coding problem is going to be visually explain the Pythagorean theorem use minim that's the
code I wanted to generate I'll actually remove this part because already know how to run this on Mac OS now I um use the same prompt with uh deep seek R1 it was able to walk me through a step-by-step process and figure out some of the installation issues so let's see uh what is the code that is going to generate now internal thought process for this specific prompt is that understand the goal so this is exactly what we want then core visual ideas so it came up with how it's going to approach it and what
are the different primitive Concepts and different elements it wants to use and also came up with the breakdown of the scene so how the animation is going to look like then a code structure and functions refinement and details so it's a pretty detailed code but let's see if it's going to actually work it also gave me a very detailed explanation of what uh different parts of the code does so here's the code now I'm going to run this um and you can see here that I've been testing different models on the same prompt let's say
if you're going to run into any issues now it seems to be able to generate the animations now we run into some issues so let me see if it's going to be able to uh help me with this I change the color to Green other than that it was able to execute it so let's see what the final output looks like okay so here's the final output uh animations looks uh pretty nice oh it did add some pretty interesting animations uh and I think it does add um some text as well which basically explains uh
the uh main concept although um I would probably wanted to use a different color scheme but uh this seems to be working now my final coding task is going to be a task of building a web app which is basically a type of features that I would usually use in llm for in this case I asked the models to create the project structure and give me a bash command to create that whole project structure with a single command it seems to be doing that and then it's writing the different code files now in general so
far this seems to be really good at coding and it was reasonably good on the misguided attention problems as well okay so I'm going to just copy that command and it's going to create another I think it create just refresh yeah I think it created and this folder for us which has all the files that I need so I'm going to just quickly copy everything here and then let's run this okay so I had a quite a few back and forth with the model and it gave me like some updated code snippet then I said
I'm too lazy give me the complete code and um so after having a u couple of back and forth trying to fix some issues it was able to uh give me the complete code so here's how the app looks like so I as it to create an image of a tiger now in the background it's using the flux model hosted on replicate if you ask it to regenerate seems like the functionality is working and also if you ask to download the image the download functionality also works right so it does work but I think it
took a little bit more time than what I was hoping for so overall a pretty uh impressive update especially this is one of the few models that was able to solve some of the misguided attention problems the coding problems are relatively simple but something like this could be a feature implementation that could be useful now the fact that it supports Now 1 million tokens is a huge boost and also the fact that it can generate up to 65,000 tokens which is pretty crazy and above all you can use this for absolutely free both the API
and the web interface it's it's crazy to think that we have such a great R and such smart models for absolutely free now you do pay with your data if you are actually building a real product or sharing your personal files propriety data then I definitely recommend going either a paid API which you assure that there are no data privacy issues or just runs smart enough models locally because now we have quite a few options anyways this was a quick look at Gemini flash 2.0 thinking with the latest update which is currently the best model
on chatbot Arena leaderboard there are going to be a lot of reasoning models in 2024 so apart from Agents I think this is going to be one of the hot areas anyways I hope you find this video useful thanks for watching and as always see you in the next one