this new large language model has taken the tech World by absolute storm and represents a big breakthrough in the AI research Community last Sunday while Tik Tok was banned for 12 hours an AI research team from China released a new large language model called Deep seek R1 as you can see on the screen deep seek r1's Benchmark shows that performs at a similar level to open ai's 01 model on reasoning problems like math coding and scientific reasoning and in this video I'll talk about the three main takeaways from their paper including how they use Chain
of Thought in order to have the model self-evaluate its performance how it uses pure reinforcement learning to have the model guide itself and how they use model distillation to make deep seek and other llms more accessible to everyone Chain of Thought is a very simple but effective prompt engineering technique where we pretty much ask the model to think out loud where we add to our prompt we want to that we want the model to explain its reasoning step by step that way if the model makes any mistakes we can easily pinpoint where in its reasoning
it was off so that we can reprompt the model to not make the mistake again here is an example from the paper where if you give the model a question like this math problem you can see that in its response it actually reasons through it and gives you the steps to how it got to the solution it showed its work you can see in red it says wait wait there's an aha moment as well as let's evaluate let's reevaluate this step by step and in doing so the model is going to have a more accurate
response than if if you were to just give the response by itself without Chain of Thought reasoning the way deep seek uses reinforcement learning is a little different how most AI models are trained we don't give it the question and answer we kind of let it learn on its own this is exactly the same way in how a baby learns how to walk for the first time if you notice you ever seen a baby it's actually pretty funny they stumble around the environment and they maybe hold on to things as they try to decide how
to walk and in doing so they're learning how to move and position their joints so that they don't fall and in the same way reinforcement learning allows us to train a model by optimizing its policy AKA how the model behaves and it does so to maximize the reward and as it explores its environment over time it learns which policies maximize the reward and then it just probably picks the policy over here or the policy over here for example if you're solving an equation like this there's two or three different ways to solve it but one
of them is much shorter than the other way to solve it and thus has a much higher reward than the other and reinforcement learning is exactly how most robots learn how to walk and how Tesla's self-driving car learns how to drive through a city and if we go to the paper and look at this graph we can see how deep seek R1 improves how accurately it can answer questions if we train it over time using reinforcement learning instead of telling the model what a correct answer is to a question since that kind of data is
pretty expensive to obtain we instead let it figure out on its own while measuring how accuracy how accurate the model is you can see while open ai's 01 model is static deep seek R1 eventually outperforms open ai's 01 model and if we let it train for even longer it looks like it's even it's going to perform even more and get closer to 90 or even 100% accuracy if we kept training it and you can see how the model uses Chain of Thought reasoning in order to improve its responses over time and self-reflect in reinforcement learning
we can't exactly tell the model how to change its policy so that's why we use Chain of Thought reasoning to force the model to self-reflect and evaluate to change its Behavior to get closer to a maximum reward that way we can kind of give the model the right incentives using prompts and the model can re-evaluate how it answers respon how it answers questions and it can do so with an increasing accuracy and this equation is the key behind how deep seek uses reinforcement learning in order to optimize it policy it uses group relative policy optimization
in order to essentially use this equation to score how well it answered a question without having the correct answer so this looks very very complicated and I'll just briefly explain the most important parts of it what we do is we take pretty much the expectation of the old answers from the old policy the model has and remember the policy Pi this is the key thing that we're trying to optimize with deep seek where we want to change the policy so that deep seek can then output better and more correct answers so we do is we
take a weighted average of how the model responded with its old policy and how it used its old policy to answer questions versus how the model's new policy answers questions as well and we also multiply it by some standardization value Ai and AI is like is basically saying compared to the average reward how well does this new policy increase the reward and what we also want to do is we don't want to have the model's policy change too much because that can cause a lot of instability with model training if you look at most reinforcement
learning charts and graphs or even the example of a baby the baby's going to fall down unpredictably so many times and what we want to do is we want to make sure our model is as stable as possible and we avoid a roller coaster of policy changes that's where this clipping comes in clipping essentially restricts how much our policy can change by 1 minus Epsilon and 1 plus Epsilon and we also standardize that so the weighted average is taking basically how small of a change can we change our policy in order to maximize the reward
we also subtract it from this regularization term called K Divergence this pretty much also is another way for us to stabilize our model training by making sure it doesn't change too much and in in short all this is trying to say is that we don't want our policy for our model to change too much but we want to do so in a way that we can compare our old answers with the new answers and then we we change our policy so that we can maximize ultimately the policy changes we can maximize the reward from the
policy changes that are minimized there's like a min max kind of situation here and that's what it's doing here with the weighted average and so the third important technique that the Deep seek researchers use used with their R1 model is model distillation and the idea here is that the actual deep seek model is 671 billion parameters and to run this you pretty much need a couple thousand GPU at least as well as a pretty expensive computer to actually run the full model so to make it more accessible what they do is they take the larger
llm and then they use it to teach a smaller llm how it reasons and how it answers questions so that the smaller llm can actually perform on the same level as the bigger llm but at a magnitude of a smaller parameter size like 7 billion parameters and in the paper deep seek the Deep seek researchers distilled from their deep seek model into llama 3 as well as quen and the idea here is that the teacher uses again Chain of Thought reasoning in order to generate examples or generate a lot of examples of it answering questions
and then those examples it just gives directly to the student as as part of the prompt and the student is supposed to answer the questions in the similar accuracy as the larger model and this makes the whole llm ecosystem much more accessible for people who don't have as much resources and the key Insight is that in this paper they found that the student model during reinforcement learning training actually outperforms the teacher model just by a little bit but it's doing so again at a small fraction of the memory and storage required to use it and
in the experim from the paper the researchers actually found that these smaller distilled models from Deep seek as I said perform outperform larger models like GPT 40 and Cloud 3.5 Sonet in this in these math coding and scientific reasoning tasks as you can see in the table below right here and from those three things those are kind of the key Concepts behind how deep seek works and hopefully you enjoyed this video and if you want to you can go read the paper in the description below as well as play around with deep seek on AMA
yourself