Phi 4 on Ollama - is it REALLY better than Claude 3.5?

15.71k views5896 WordsCopy TextShare

Chris Hay

PHI 4 is a new 14 billion parameter model from Microsoft that is claimed to beat GPT-4o, Llama 3.3 a...

Video Transcript:

hey welcome back so Microsoft have just released the brand new 5414 billion parameter model which is insanely good for the size of model that it is now Microsoft is claiming that it beats GPT 40 and code 35 Sona in the area of math which we all know is complete nonsense however that doesn't make it not a good model because actually the one thing that this model does really well is Chain of Thought reasoning now it's a little bit frustrating but they have set themselves up really well for the future and I think there may be one or two models short of having something that really performs in the areas of inference time compute now in this video I'm going to show you how to get started with the 54 model using ol Lama on your local machine and I'll show you some of the pros and cons of the model um and as you see it's it's a really great model I just think it's maybe one version too early but let's get started I'll show you how to get on your machine and so you can start playing with it yourself now obviously to get started you're going to have to have Ama already installed in your machine I cover that in another video um and then really what you're going to do is go into ama. com and then you're going to click on models and then if you just search maybe on newest at the moment you'll see that at the time of recording 54 isn't officially in the library but if I do a search for 54 for example you see there is a model there called vanel uh J 5-4 and when Microsoft released that it's just going to be called 54 um and you will be able to uh run that by just doing ol Lama run at 54 because that's how it's worked for every other F model in their series however for today we are going to run vano j54 uh and I'm going to use the uh q80 version of that model because I've got a lot of memory and and I want to get as best results as possible but you can just run the latest on your machine and you should be able to run on if you got something like 16 Giga memory or whatever so as you can see I've got 54 now downloaded and if I want to chat on the terminal I can just type in something like and you see it comes back and and it's pretty good in that sense now of course I don't want to be interacting with it in the terminal in this sense so I going to use uh open web UI and if you want to get started with open web UI you can just go to openweb ui. com click on docs and then there's a pretty good getting started guide and in fact if you go to the quick start and then do the docker pull followed by the docker run then you're going to run that locally with uh Docker on your machine and that's what we are going to do today so I'm just going to do a Docker run and that is going to have open web UI running on my machine and then within my browser I can go to Local Host 3000 and I can chat with the model directly so you see I've got new chat and then I'm going to select H manual 4J there and then I'm just so that giv me the 54 model that will be 54 when it's officially out on all llama and then I can just chat with the model as before and you see it's coming back with answers and then I can say what is your name so it's pretty easy to get started with a 54 model it's fairly quick as you can see there and it's is coming back with good nice responses now they reckon that this model is super good at math so if I give it a typical sort of math style problem a grocery store sells apples for $1 each and oranges for $150 each if you have $10 1220 is twice as many apples as oranges how many of each you can buy so if I ask it that quick question then it's going to work through a little bit of Chain of Thought and I think this is actually the key to why this model is doing really well in the math domain that it's not just coming back with an answer straight away and it's actually working through its chain of thoughts So eventually it's going to come back there and say uh since o equals 3 access the budget the only feasible solution is oal 2 and 4 so it's worked through the problem it's even done the math there 2857 and it's claimed you can't buy a fraction of an orange so it's done a really good job there especially for 14 billion parameter model but of course that doesn't make it as good as cl35 son or GPT 40 but but I think this is why it's doing so well in the math domain because it's working through these Chain of Thought type problems and again I can give it a few more queries so if I give it a kind of you're given the equation blah blah blah find all the integer Solutions and then it's going to come back uh it's going to do the calculations on this and again you can see it's working through all the terms and it's s working through uh solving this equation again using Chain of Thought and then finally it's came back with the answer and that is absolutely the right answer and for a 14 billion parameter model this is absolutely insane it is is doing a great job and I can give it some pretty complicated Solutions as well so in this case I'm going to ask it what I call the 650 Gates problem which is kind of like the 100 Gates problem but I've just been lazy and I've made it 650 Gates um and in this case it's a garden with 650 Gates all closed you make 650 passes through the garden you know and then on each pass you're basically toggling uh the gate uh as you go through and you're going to work all the way through to 650 here and then what I've done is I've asked it to come back with the first 10 open Gates and the last 10 open Gates now in this particular case it's done a fabulous job it's actually got all the right answers so it's got the first 10 Gates there the only problem I would say with the last 10 there it's missed the final gate there which is gate 6 uh 25 it's got it over here since there's only you know it will include uh 625 but then when it summarized it at the bottom it's missed the last 10 but I'm I'm not really going to bother about that too much the fact is that it did solve the problem and again it's doing a great job this is a really good math model but of course both G pt40 and CLA solve this as well the problem I would say if I give it an actual math question like 254 multiply by 752 - 3 blah blah blah blah you're going to see there it is actually going to mess it up and it's not going to give the right answer so as you can see it says the final result is 1 190805 which is in fact the wrong answer and as you can see the answer according to Claude is 1 19105 which is in fact the correct answer however the good news is for this model if if I were to ask the exact same question of GPT 40 and this case I'm going to say do not use tools and I'm going to say uh what's that as you can see GPD 40 messes up that question as well but if I obviously say use tools there I'm been very shouty with my caps locks there uh you see the right answer is 1 19105 which of course Claude got so if I'm honest Claude is still the better model however for 14 billion parameter model it's done a really really good job so as you can see there it is actually really good at math and it probably is pretty close to what claw 35 son it and what gbd4 is doing in a mass say but I would say claw 35 son it is still beating it but what is really good is the chain of thoughts that it generate so when you combine that later with an inference time computer which I think is the real solution for doing math problems if you take away kind of using tools and function colum for a second then I I think they're really setting themselves up for this kind of Chain of Thought uh type mode and you can see this you can actually see this in the model so if I uh if we do some typical kind of puzzles that I like to do here so in this case I'm going to do my sodoku grid problem so I'm going to ask it this question is this a valid soku grid now in this case it's going to generate the right chains of thought but it may or may not get it actually right so if we look at this grid for a second here you can see in this final column you see the number seven and the number seven appears there so we're going to ask this question it should pick up the fact that this is not a valid GD so the first thing is that it actually back to my point the chain of thoughts that it's doing here is really good this is what I'd expect these models to be doing I would expect it to check it row by row column by column grid by grid to check if the answer is correctly so so far it's reckoned all the rows are correct but it should now pick up that in the final column there that is invalid and it's also picked up column 4 that is invalid because it's got two twos so uh we can check that there we go 23s there and invalid 27 so it's picked up uh The Columns there and now it's doing the grid so this is a really good job that it's doing there this is how you should be approaching this so again if I were to go into something like chat GPT we'll do 40 in this case so actually if you look at this for a second there GPT 40 has actually done a pretty bad job at this as well because uh it's kind of skipped the rowby row validation and the column validation there and it's went straight through the grids so it it really hasn't spent the time uh looking at this which is of course the uh the which of course the 01 models would do so and again this is really setting up for inference time compute you can see if you combine what is happening with this tiny 14 billion parameter model but then you put in for time and compute in front of this then you're going to get uh kind of great results in this sense and so in some regards it is doing a better job than GPT 40 but it's obviously not going to do a better job than anything using inference time comput such as 01 now I'm going to give it a slightly harder uh question and this time I'm going to give it this grid here um asking the same question is is this a valid uh sidoku grid here probably the key thing that I want you to realize from this grid is that the rows and the columns are generally okay the issue is going to be within the grid so if you look at this Center grid here you see there is two fours four and a four so let's see how um the 54 model does and again it's going through that exact same sequence as it does before it's checking row by row by row okay so it's came back now it is saying that um since the bottom middle sub grid contains a duplicate nine the grid is not a valid sidoku solution but it's missed the Cent GD it's missed the fact that uh the four and the four there is invalid but it's picked up at the bottom right here it's saying there's a duplicate nine but you know actually if we were to compare the grid there the bottom right there's not a duplicate nine in this case it's it's a missing nine so it's got the right answer and it's doing the right Chain of Thought but it's not quite um got to the right outcome if I go into uh deep seek R1 model for a second and then I ask the same question now as you can see here R1 is sort of following the kind of similar path there uh it's messed up that Central box as well but it does pick up the double n in the bottom right box so uh and eventually it says uh the grid is not valid due to duplicate NES in the eight box but it's also missed out that sort of middle box here now if I bring this into 01 for a second uh as you can imagine 01 isn't going to mess this up at all because it's a really really good reasoning model and there we go you can see 0 one's came back with the answer it's picked up the error in the middle box there and to be honest it's then sort of given up looking at the other ones there it's just f on this middle boss and it's summarizing but it picks it up where the other models don't but the key to this is the Chain of Thought so you know if you took that 54 model and combined it with inference time compute that's not going to make the same mistake because it's going to have that ability to reflect and check at the answer in the same way as the Deep seek R1 model did there as well I don't love the Deep seek R1 uh you know uh way of doing inference it generates a lot of tokens for my liking I I think the way that four is generating each chain of thoughts it's much cleaner and much more in line with kind of what 01 is actually doing there so we're going to try it with another question this time we're going to do a little bit of tic-tac-toe so I'm going to ask it what is X best next move so I've given it a game board um and you see the board there so x and x and o's just blocked X there at this point in time and you see it was over there so what is X going to do now you what you would expect X to do is do the forky thing so you really want X to place a position here because it sets it itself up for a win here or a win here the other alternative is it could place an X here which would set itself up a win for here or here so either of these answers uh would work really well so if I look at GPT 40 is saying the best position is going to be position three it's got the right answer so that's that's done a pretty good job here let's see how 54 does for a second again as as you look through this what is really nice about this is it's actually working through its Chain of Thought So it's actually looking to see if it can uh get a winning move whether it can do some block again setting itself up for inference time compute um so there you go considering the options best strategic move for player X is to choose position three now if I compare this to some other models for a second so if we look at um let's say we look at Claude for a second you see Claude has actually got to right the move should be made at position three so well done Claude if I were to ask deep seek R1 for a second so if we ask deep SE car one for a second again this is an inference time computer it is generated an awful lot of tokens a lot of tokens and actually it's came back with the wrong answer it saying the best position for X is position seven which it's clearly not going to be in this case because you're not giving yourself the opportunity to do the forky thing so it's generated a lot of tokens but it's came back with the wrong answer and the reason it's came back with the wrong answer is although it's used it impr time Compu and generated a lot of tokens it's not been trained on good Chain of Thought for solving the game of tic taac toe whereas if I look at the 54 model you can see that it's following a nice clear logic evaluating each positions the old1 models do much much better but this is setting itself up for a good future now if I go into gbt 40 mini for a second and ask what's the best next move you you can see it gets it wrong it says position two now of course if I look at 01 mini 01 mini is absolutely got that right and again if I ask GPT 40 as you saw a second ago it's going to come back with the right answer position three so you know fir 4 is really setting itself up well it is doing a good job on these puzzles now if I want I can compare it against llama 3.

3 uh which is the 70 billion parameter model running on my machine L has the same question of llama 3. 3 let's see what it comes back with so there you go you can see there llama 3's came back and it's you know llama 3 is a 3. 3 is a much uh slower model but it's not trained in its chain of thoughts to be able to handle games like Tic Tac Toe it's not set up for inference time Compu in the same way and you can see llama 3.

3 is messed that up and it says Place their Mark in position two all right so 54 is done a really good job at Chain of Thought and you're thinking to yourself fantastic that's exactly what I need for agents and you would be right that is that is where you want to go but there's a problem that we're going to see in a second so I'm going to I'm going to run my mCP uh CLI which is uh the model context protocol created by anthropic and I created a little tool that allows you to do function colum via or llama I did a video on this in this case I'm going to run this with llama 3. 2 and then we'll compare that to see what happens with 54 so in this case I'm going to enter chat mode and then I'm going to say select uh top 10 products ordered by uh price now this is llama 3. 2 in this case and you see it's going to flail around a little bit messing up the errors and you know it kind of gets it wrong I can lead it to the right answers I can say um you know what are the tables and you can see now it recognizes it needs to do a list tables it will get the answer and then I'll say uh uh describe products tables and then I can say select top 10 products ordered by price because llama 3.

2 now is aware of everything that's in the table and it can make that query so you can see it's the tool call and eventually it comes back with the answer now if I quit out of this for a second and we run the exact same thing with llama 3. 3 llama 3. 3 is going to do a really good job I'm going to say select top 10 products ordered by price and because llama 3.

3 is a really smart model unfortunately it's a slow model because of a 70 billion parameter uh model but you're going to see that it's going to do the tool calls that it needs to be able to calculate this but it will come back with the answer and as you see llama 3. 3 comes back with the answer although it takes an age so this is the use case P for 54 it's really good a Chain of Thought right so agents here we go here we go so I'm going to run my exact same tool I'm going to put 54 in there I'm going to put chat and I'm going to say select top 10 products ordered by price and we are going to find the best use case for this and is AOL Lama area it does not support tools and that is the problem we have a model that is grade Chain of Thought but it's not actually supporting tool calling at this point in time so you know I just frustrated because that's the use case I want to use it for but it's not available at the moment with old llama um because it's not supporting tool calls so maybe they're going to get a fine-tune version at some point it's able to do this but it just frustrates me so let's do another one at the moment this time I want to test its ability to role play so in this particular case I am going to uh get it to run as a Persona so in this case I want it the Persona is going to be an internet Troll and then I'm going to talk to I'm going to say uh Hi and then let's see what happens so oh hi there finally decided to Grace this with your presence I starting to think you've being busy scrolling so it's not bad um uh you know we'll say tell me about cheese let's see what happens you know and and the answers are okay here but they're not very troll like and I think this is one of the things that I would say is it's a 14 billion parameter model it's obvious obvously trying to be clean cuz it's trying to be used for Enterprise E sake but if I ask the same question this time let's just use uh L 3. 2 for a second and then we'll do the same over here and now I'm going to say hi and there you go another brainless drone stumbling into my real thinking they can handle the truth we you saying tell me we'll say the same questions tell me a story about cheese you know you want to hear a story about cheese how quaint let me Regal you with your epic tale of frage great again the Llama 3 models even a 3.

2 which is a much smaller model it's like an 8 billion parameter model I'm not even running 3. 3 at this point the personality for roleplay from the Llama models is so much greater than the F models so that would be something that I would think about there that the although you know the five models can role play they don't have the same level of personality and again you can kind of see this a little bit further so if I I did a sort of test earlier on today um this is where I role played with a kind of pirate so you see I said you're a pirate answer all questions as a pirate if you understand say so I didn't say it as a system prompt I just did it as a user prompt and then I said tell me a story about cheese and again there it's it's doing the a Hoy there m and then it's talking about his trusty crew gor gonzola George you know the story is okay and then I asked what's kubernetes and then you know it gave a little bit of an explanation their computers servers and all this now let's compare this to the Llama 3 answer which is um you know again llama 3. 3 is a much bigger model so 70 billion parameter model but let's look at the difference here right so I said you're a pirate answer all questions Tell me story about cheese look at this as a difference if it you know it were a dark and stormy night on the high seas you feel the storytelling on this you know uh we're sailing a good monsters Revenge we've been at week punding and pillaging Etc but I captain Billy B black beak Billy Etc and then my trusty first mate Barnaby blackart i' be thinking a bit of a limberger yeah this is so good right the the storytelling capabilities of Lama 3.

3 is much much better in fact this is something to kind of be aware of and it's really a difference of bench marks is two things might be able to do the same but when you ask a bigger model to do a to task like storytelling or a model with more personality it's going to do a better job and you kind of you can kind of see this and this is really great here when I say what's kubernetes you see your land L you want to know about kubernetes here it's like deploy your container on a cluster of no ships scale your containers to meet demand add or remove crew members you see what it's doing there it's it's holding its personality but it's doing this comparison of containers to what a pirate crew is like so again I think llama 3. 3 is a much better model it's a bigger model i' expect it to be a bigger model but to be fair to 54 it's done a good job so probably the big question everybody's going to be asking is going to be on the area of coding it's actually a pretty decent coding model for simple coding stuff and they're not claiming in the paper that they do coding very well um but it's nothing compared to kind of Cl 35 son and GPD 4 as you expect so in this case I'm going to ask it to create an app and react it gets the latest time from the internet time service updates the time every second with real time from the internet keeps it in sync do not use external dependence and create an AI generating donkey using SVG and a donkey should blink every time a second changes so we'll we'll pick that for a second now while it's generated that let's actually ask Claude uh the exact same question there just uh as a little bit of comparison now look at what cla's generated we can argue whether that's a good donkey or not um it looks more like a kind of u a Pikachu that's sort of ran through the mud or something like that but you see it's blinking in time with the the the seconds there and it's generated a donkey there so I think Claude is the best at this in in one of my previous videos it just drew a beautiful donkey um uh 01 yeah let's compare that to how 54 is actually done for a second so I'm going to I'm going to pick up this component here for a second I'm just going to paste this in here and then I'm just going to do a bun run for a second and then I'll just do a bun run start for a second and we'll see what this looks like uh yeah I'll run this on another port and let's see what it generated it's not it's not bad actually actually um you know real time claw 5718 the that is not bad that is not bad at coding is it better than claw we can sit and argue about this maybe that's slightly better than NIS uh but you know it's it's it's actually done a pretty good job now it's it's not a bad coding model so you're going to do well with JavaScript you're going to do well with python on it um I think it's pretty decent and they're not claiming that um this is this is the best model for coding but again you know if we uh if if we compare it to something that is a little bit more complicated there um so maybe I'm going to say uh here is some code for uh some embeddings and I want to get it to fix it up so we'll just come back in in here let's uh ask this question here now there's a couple of things I want you to notice here so I've asked that the here's some code they extract some image embeddings from you know and I've given it to code form Transformers blah blah blah notice I've asked it to use AI Gina clip V1 uh and already it's went off and tried to use open AI clip vitb so it's not actually followed my instructions and if I run that code it's not actually going to run so I'm not going to bother doing that even if I ask GPD 40 mini for a second to uh go and do this it's going to it's going to do a decent job of this so if I just copy that code there for a second let's come back to uh vs code we'll just paste it into here and now I'll just do a python test. p high and you see it's came back with a cosign similarity of 0.