Deepseek R1 [Tested]: The Best Open-Source O1 Alternative - Beats O1 & Sonnet 3.5

4.97k views3475 WordsCopy TextShare

Prompt Engineering

DeepSeek R1 is probably the best open weight reasoning model to date. I tested it on coding and reas...

Video Transcript:

we have started seeing some independent tests of deep seek R1 and it looks pretty strong even on my own tests this seems to be one of the best open weight models available and in some cases is even better than o1 so we're going to look at a few tests uh and the test is going to be in regards to coding reasoning capabilities especially whether it can understand tricky questions using misguided attention and then we're going to address some of the controversy behind this model but now we have independent tests such as uh live bench so

here overall when it comes to coding mathematics and reasoning capabilities deep seek R1 is just behind the o1 open AI model and this is completely open source if you have the hardware to run and even the API cost almost 50 times less than o1 similarly on the AER Benchmark all one scored about 57% on The polyot Benchmark so it's just behind the 01 model so here are the results it scores about 57% on the correctly completed tasks and then when it comes to editing it is actually better than o1 with about 97% tasks in this

video I'm going to run two different sets of tasks the first one is going to be coding we're going to test it on a couple of of uh coding problems and then I want to run uh a few reasoning tasks because this model is supposed to be really good at reasoning just a spoiler alert this is probably one of the best models I have seen I'm testing this in their official web UI so you can access this on chat. deep seek.com okay my first prompt is a relatively simple one so I wanted to create a

web page with a single button that says click me and and it's supposed to show random jokes from a list of jokes and then whenever we click that button it's supposed to change the background and also show us a random animation now here's the internal uh thought process it's very human because if you look at the um output it says like okay let's see the user wants a single HTML page with specific features right so if you read through this uh especially like this part which says the random animation part H the user wants a

different animation each time right it it does seem like a lot more human than some of the other llms that have tested so here is the code that it generated and I think there are some instructions here as well so let's copy this okay I'm going to paste it in this online code editor we're going to click run so we have a background with a button that says click me and if I click it it does change the color it randomly picks a joke this is is probably repeated every time that I run a test

and it's also showing the animation so a really good start so far my second test uh is a lot more detailed and this is kind of the work that I would actually use in llm form so I wanted it to create a web app that would take uh a text input from the user and then use an external API to generate an image so in this case I wanted it to use the replicate API so I provided that documentation and then I simply said that provide detailed documentation on how to run the app one more

thing which I usually ask uh the llm to create the structure of the project and then create a bash command that I can just directly use to uh create that structure so here's the thinking process this is a lot more robust than uh some of the other llms that I have used so here's the structure right and after that it give me this bash command so I just use that bash command to create the project structure and also it provided the python code plus the requirement s text this is the front end some styling in

here right so these are the different files that I need to create now here I run that command and created this file structure for me then I simply pasted the code for each of the file okay so when I ran that for the first time it gave me this error so all I had to do was just copy this actually I took an image of this so provided that image and I said I'm getting this error so basically there was an issue with serializing the output so now it's using the base 64 package and give

me the corresponding instructions on how to fix it okay so with all the codes fixed I just had to run this command so the app is running here so let's try to access it okay and now we can just uh provide our instructions so let's say I wanted to create an image of Al llama with sunglasses when I click generate it did generate the image so it's working correctly let's try to regenerate that seems to be working and if we try to download the file the download also works so this is pretty neat okay next

I asked you to create a detailed tutorial to visually explain the uh Pythagorean theorem and I specifically asked it to use uh minim now in this case I had a little back and forth uh because uh the minum package was not correctly set up but it was able to walk me through all the instructions and uh provide solutions to every error that I was facing so here's the final code that it provided and here's the final output that it created using the code so it's going to try to visually explain Pythagorean theorem and the representation

seems to be pretty accurate the visualization pretty nice so that's really good and it also added um really good styling in there there are some details details in the text right so overall I think it did a really good job of explaining the pagor theorem okay so next we're going to test it on the misguided attention repo now these are famous questions or paradoxes but the author have made small changes and U the goal is to test the ability of these reasoning models uh to pick up those small changes in most of the cases um

many llms mistakenly recognize the UNM modified problem due to the frequency of occurrence in their training data so if a model is able to reason it should be able to focus on the language of the prompt that is presented rather than the examples that it has seen in training data okay my first prompt is a modified version of the trolley problem now in this case the main difference is that the five people that are on the track are already dead so let's see if it can actually pay attention to this part where the people are

already dead now this is the real time speed and one thing which you notice right away is the internal monologue is very humanlike very unlike other llms that I have tested let's just quickly look at the internal thought process and then we're going to look at the output so it says okay I need to figure out whether I should pull the lever in the sty problem let me start by recalling the scenario so there is a runaway trolley heading towards five dead people this is the first time I actually see that an llm recognized that

they are already dead if I do nothing it will continue on its Uh current path and hit them but wait the five people are already dead that's different from the usual trolley problem where they are alive then there's another track with one living person and the question is whether I should divert the trolley to the other track or not all right then it says hm the first thing that it comes to mind is the classic ethical dma where you have to choose between saving more lives by sacrificing one but here the twist is that the

people are already dead so maybe the initial thought is that saving the five isn't about preserving life but perhaps respecting their bodies so is there something else going on here now it really has an internal monologue with itself because the scenario is very different and I'm really surprised that it was able to pick it up because even 01 based on my initial testing wasn't able to pick up that the people are already dead so really smart so here at the end it says but I should check if there's other interpretation maybe dead people is a

typo or should be people but the original question says dead people so assuming that's correct the key difference is that the five people are deceased therefore the moral dilemma isn't about numbers but the value of living versus dead right so the final answer that it comes up with is this the ethical dyema presented here is a variation of the classic trolley problem but with a critical twist the five individuals on the main track are already dead while the alternative track holds one person here's the analysis right so it comes up with different point of view

but in each case since it figured out that the people are already deceased the conclusion is exactly the same so at the end it says the ethical choice is not to pull the liever diverting the trolley would unjustifiably sacrifice a living person for no net gain in life saved while liting both utilitarian uh deontological principles this is the first model that actually is able to pick this up so I'm I'm really impressed so far now in order to make sure that it doesn't catch the patn for every new prompt I'm going to be starting a

completely new chat session okay next we're going to ask a modified version of the um Monty Hall problem now in this case uh there's a huge twist and um the final output should be a 50% probability whether you decide to switch the door or not let's see how it tackled this one again the reasoning processes are really fast but one thing I have noticed is that the thinking process is a lot more verose compared to 01 or the um Gemini 2 flash it says okay uh so I am on the game show with three do

one has a car the other two have goats I picked door number one initially then Monty Hall opens door number one and show the goat wait hold on that's not how the usual Monty Hall problem goes normally you pick a door when Monte opens a different door that has a goat right H maybe there's a mistake in the problem description let me read it again right so it reads the problem again which is pretty smart then it says wait a second in the startup prong uh after you pick a door Monty opens a different door

that is a goodat not the one you picked here according to the problem Monty opens door number one which was my original choice revealing the goat then resembling remaining doors are two and three and I pick door number three then Monty offers me to switch the door this seems different than the classical problem so again very smart in the beginning because it's able to pick up the changes of these prompts from the original prompts that it has seen in training data so that is pretty impressive now it goes through a whole bunch of different scenarios

the first one is basically the modified version and then it even considers what H what would happen if the problem was not modified right so pretty smart in that sense but in the end the final output is in this modified Monty Hall problem after initially choosing door number one which Monty reveals has a good the car is equally likely to be behind door number two or three so the final answer is switching to door number two or sticking with door number three gives you the same probability of winning uh car which is 50% each therefore

there is no statistical advantages advantage to sticking to uh switching in this scenario extremely smart this is probably one of the few models that um is able to U give us correct answer for this modified version of Monty Hall problem next we're going to look at modified version of the shinger cat Paradox now in this case the only difference is that the cat is already dead and let's see what happens here again very both thought process and extremely fast but let's look at the beginning so it says so I came across this physics problem with

a cat in a box with some radioactive isotopes poison and radiation detector it sounds familiar like tranger cat thought experiment so it says let me uh try to work through it step by step and the step by step seems to be coming from Chain of Thought reasoning that is embeded in the training so it says let me restate the problem to make sure I understand there's a cat in the Box along a nuclear isotope a v of poison and radiation detector if the detector senses radiation it releases the poison which would kill the cat the

box is seal and then open one day later the question is what's the probability of the cat being alive when we open the box now seems like in this case it didn't pick up on the cat is already dead so it goes through the whole quantum mechanics calculations and tries to figure out what would be the probability so let's see what the final answer is so it says the problem involves a cat in the box with a nuclear isotope a o of poison and a radiation detector if the detector senses radiation it releases the poison

killing the cat the box is opened one day later right so key elements cat survival depends on whether the nuclear isotope decays the radioactive decay is a probabilistic Quantum mechanic process so it goes through the classical calculations and comes up with the probability that the cat is alive is 50% if you open the box one day later okay so let's see if I ask it does the initial status of the cat has any impact on the conclusions let's see uh how it responds now keep in mind in this case it basically uh converted back to

its training data so let's see what the internal thought process is so it says Okay so the user is asking if the initial status of the cat affects the conclusion on strenous cat scenario let's break it down all right I think even that question or that nudge is not able to pay attention to the status of the cat that is dead in the beginning so here it kind of refers to if the cat were already dead then the when placed in the Box the probability of it being alive when the box is is open would

trivially be 0% regarding of the isotope Decay so it does figure that out but I don't think it was paying enough attention to figure out that the cat is already dead okay so here's another one a farmer is on one side of the river with a wolf a goat and a cabbage and his goal is to Simply transfer the goat to the other side of the river we don't really care about the status of the wolf and the Cabbage so let's see if it gets confused based on his training data or is going to be

able to pay enough attention to the details in this specific problem it took quite a while I don't think it uh was paying attention to the question in this case now uh this is all the reasoning so it used a lot of tokens but then a it came up with an overly complicated procedure on how to get the goat to the other side so the sequence of steps is that take the goat to the right side here's where it should simply stop but then it says return a loan to the left side then you take

the wolf you bring back the goat and so on and so forth right so seems like it's simply paying attention to its training data rather than the question in this case okay let's test it on a lot simpler problem but this also uh confuses a lot of different uh Frontier models so we are asking that we have a six lit and 12 L drugs and then want to measure exactly six lit and seems like again again it is going through unnecessarily long reasoning Loop now it took a while but the answer is right this is

what I would expect so it says to measure exactly six lit using 6 lit jug and 12 L jug you have two straightforward methods the first one is fill the six lit jug completely that will give you six lit J lits pretty aw and the second one is that use the 12 lit jug then pour 12 lit pour water from the 12 lit jug into the 6 lit jug until the 6 lit jug is completely full this is pretty great because most of the other Frontier models will just go through a whole bunch of reasoning

without clearly answering the question and the question is pretty straightforward overall I think it's a very impressive model uh I haven't seen this type of performance from other um reasoning models so I'm really impressed with some of the answers that it can come up with okay in this last part of the video I want to address the uh issue of censorship when when it comes to models from China so for some reason uh Whenever there is a new model released from China people bring up this issue of censorship so for example if I were to

ask um deep SE car 1 or any of the Chinese model for that matter tell me about t Square so here is what happens it starts generating a response and then all of a sudden seems like there's a guardrail on top of it which says sorry I'm not sure how to approach this type of question yet let's chat about math coding logic problems instead it's a non unknown issue however it's not limited to Chinese models each of the model creators like open AI anthropic groc they have their own political biases and if you ask them

about certain topics or certain historical facts they will deny responding if you're using to test the political affiliation or historical facts of an llm you are doing it wrong I don't think they are capable enough to give you any political opinion other than the one uh that the creators of the models have instilled in them or you can't even rely on them for checking historical facts because everybody has their own version of the history and the models basically inherit those now when it comes to open weight models the beauty is that that even from this

example it seems like for this topic R1 is willing to generate a response but there is a God rail on top of it but since the weights are openly available you could potentially run this model and get a response out of it now you can't do that with uh other closed Source models anyways I I wanted to address this because I know uh there are going to be comments regarding uh this uh model being from China and highly censored and so so I just wanted to address that and other than that I think it is

this is probably one of the most impressive models that I have seen especially on coding as well as on reasoning tasks so do give it a try I'm going to also test the distilled versions of those smaller models from 32 billion all the way up to 70 billion so if you're interested in that make sure to subscribe to the channel thanks for watching and as always see you in the next one