How Did They Do It? DeepSeek V3 and R1 Explained

946 views2190 WordsCopy TextShare
No Hype AI
🚀 DeepSeek: The First Open-Weight Reasoning Model! In this video, I’ll break down DeepSeek’s two f...
Video Transcript:
So DeepSeek is the most discussed thing  in AI now and the main reason is basically this. It shows that the DeepSeek startup spent  about 2. 8 million GPU hours on training their language model.
That would take one GPU more  than 300 years to do this alone. But of course, they used about 2,000 GPUs so it took them  about 2 months to train their model. Now, is this number good?
Is it bad? Well, let's compare it with Llama, for example. The largest Lama 3 model was trained for almost 31 million GPU hours.
That's about 11 times more than DeepSeek! Now, on top of all of this, they managed to  train their model on relatively inferior GPU chips. Because of the US export regulations, Nvidia  cannot currently sell their most powerful GPUs to China.
So, to bypass this problem, Nvidia nerfed  their chips and instead of H100s China now gets less powerful H800s. And these nerfed versions  seem cheaper to run, apparently only $2 per hour, which gets us to this cost of 5. 6 million dollars.
That's orders of magnitude less than the training cost for models like GPT and Gemini. And the thing is that until now, only these big players like OpenAI, Google, and Meta could could be training such large models due to the costs, and the rest of us had to wait. And that might change because of DeepSeek now.
So we have much less training time, on cheaper GPUs, and the DeepSeek models are also dominating the  benchmarks. The Chatbot Arena shows that they are currently on par even with the latest models from  OpenAI and Google. So how is this all possible?
Well, there are actually a few caveats to these  numbers and things to unpack like the actual cost of training, which is higher than this. . .
But before I get to that, let's first cover how DeepSeek models work and why they are so much faster. Because, there have been lots of things circulating in the media recently, and sometimes I feel like  the authors haven't even read any of the papers. So, DeepSeek recently published two models- V3 and  R1.
And let's start with V3, which is the third iteration of their main foundation language model,  and it's basically another big language model like Gemini or Llama. One of the major changes that made  V3 so much faster is that it uses something called "Mixture-of-Experts", or MoE. And the idea behind MoE is  that instead of processing input by a single large model, we divide the model into smaller networks,  called experts, where each expert specializes in something unique - like a different domain or a  different type of syntax.
And when we process the input, we don't pass it to all experts. Instead,  we use an additional smaller network, called router, which looks at the given input and maybe says  something like: "Hmm, I think only the first and the last experts should process this input".  And for the next input, it picks a different set of experts and so on.
MoE is applied on every  layer of the transformer network, and it actually replaces all dense layers. And so if this works,  then this is great, because first of all, we're processing the input in these specialized sub-networks that can do the task better. And second, and that's even more important here, we're saving a lot of computation.
Because all other experts are being completely ignored. MoE as a concept is nothing new. It dates back to the early '90s, and for example, it was recently used in an LLM  called Mixtral.
But, the problem is that it can be hard to get it working well as many things can  go wrong here. For example, how can you actually separate the knowledge and make sure that each  expert is specialized? Also, there is a well-known problem called "routing collapse", where the router  learns to use only one or very few experts every single time and then ignores the rest.
So, you  need to somehow encourage the router to select different experts and spread the knowledge. But,  deep seek implemented several improvements here! First, they increase the number of experts.
. . I mean, sure why not?
But then, they also added additional shared experts, which are always selected. And the  idea behind this is that these shared experts will learn the general knowledge across different  contexts, which will leave the other non-shared experts to learn more specialized knowledge. These  ideas are nothing novel and have been implemented in NLP many times before, but it's quite impressive  that they've got it working on such a large scale.
So, now, you will also know what the term "activated  parameters" means. Maybe you've noticed that the DeepSeek website mentions their model having a total  of 671 billion parameters but only 37 billion of them are activated. That simply means that for  each input token, the network selects only a specific number of experts and the total number of  parameters across the selected experts is always 37 billion.
However, that doesn't mean you can just  load 37 billion parameters on your GPU. You still need to load the entire model if you want to use  it because as you generate tokens, you'll be using different sets of experts each time and will  eventually need all of them. For instance, the paper for Mixtral has a very nice example with MoE for code  generation.
Here, they colored each expert output in a different color and you can see that it's not  all just one color, right? The model is using all sorts of experts all the time, and here it seems  that it's more aligned with syntax rather than domain, actually. So, when you see charts like this  on NBC news, for example, I think they can be quite misleading.
It shows that DeepSeek uses fewer  parameters than Llama and Qwen during "interactions", which is a very ambiguous word. When you interact  with a model, you will need all of them. And some charts in papers can be a bit confusing too.
Like  this one from the DeepSeek paper on mixture of experts. They're showing that their model uses  less than three billion parameters, and it is as good as Llama-2 with 7 billion parameters. But  when it comes to performance, you will still end up using all 16 billion of them.
Just not at the  same time. However, when it comes to speed, that's a different story because MoE allows us to do far  less computation. And lowering their computational requirements and moving less data between their  GPUs was actually essential to their success.
Because, as I've already mentioned, China doesn't  have access to the most powerful GPUs right now. Last thing I want to mention about V3 is that  they train the model mostly in FP8 precision. That means, each parameter was represented by  8 bits.
People usually use FP16 or BF16 for training, which is much more stable, but it requires double the memory - 16 bits. Well, not something DeepSeek wanted to do, right? So, yeah.
. . this on its own can basically double your training speed.
But, it's not so simple. They had to use this  mixed-precision framework, where the most compute density operations are done in FP8 and some other  more important computations are done in higher precisions. And, it seems like it worked well!
If we can consistently replicate this, everyone will most  likely start doing training in FP8 now. The second model released by DeepSeek is called R1. And, R1 is a reasoning model, similar to OpenAI's o1 or o3. 
The difference between non-reasoning models and reasoning models is that when we use a non-reason  model like Llama, we kind of ask it to spit out the answer right away. Whereas the reasoning model  will first generate lots of reasoning text, which gives the model space and time to self-reflect  and explore different reasoning paths. And the final answer is generated at the end.
But, OpenAI hides this internal reasoning output from us. So, it's harder to understand how it works, verify  the reasoning, and also to replicate OpenAI's work. And DeepSeek is the first company that figured  it out!
They published a model that matches the performance of the OpenAI's reasoning models, and they  told us everything about it! So how does it work? First, R1 builds on top of V3.
They already have  this huge 700 billion parameter language model. And spend those $6 million on it. And so now DeepSeek  just wants to teach it how to reason better.
One way of learning this complex reasoning behaviour  would be to train the model on examples of such reasoning. For example, if you want high-quality  data, you could have human annotators solving math problems, explaining their reasoning step by  step, writing it down, and then you take that and use it for supervised learning. But, that would be  quite expensive!
So, instead, they were like: What if we only use reinforcement learning? Basically, they  led the model to generate lots of solutions for a given problem, and then they used a rule-based  reward system to identify correct answers and correct reasoning steps, and then they reinforced  those outputs so that the model is more likely to generate them again. And, as they train their  model, they noticed one interesting behaviour.
The average length of response was slowly  increasing and increasing. Basically, the model was learning to reason for longer and longer. Of  course, that could just mean that it's producing longer and longer nonsense.
But, no! These are the  final results, and they are very close to OpenAI's reasoning models. But, their conclusion is that  this model still struggles with challenges like poor readability and language mixing.
So, they actually  decided to collect a small amount of high-quality data and use it to fine-tune the V3 model first  before reinforcement learning. And that's the final R1 model, which scores very high on all benchmarks. Now, you might think, surely that's all we got from DeepSeek.
Nope, they actually did one more thing  that's really good for us! They took R1, asked it a few thousand questions, and then they trained  models like Llama and Qwen on those outputs. So, they basically distilled the reasoning from the  large model into these smaller models.
And these distilled models are what you actually want to  be using if you plan to try DeepSeek locally because the original models are just way too big. But,   using these distilled models is super easy. You can just use Ollama, choose any of these options, and.
. .  I will do "ollama run deepseek-r1:8b" and then just wait.
. . and wait.
. . and eventually, we get their model running!
Now, a few thoughts on the training cost and DeepSeek as a company. $5. 6 million is just the cost of their final training run - a single run!
To get there, they had to do lots of research. And  they had to run many trainings in parallel to test everything this fast. We can see in their papers  lots of interesting innovations.
But, when you have so many moving parts, you have to ablate all of  them to be sure that those decisions are justified before they get into the final product. That all  had to be very expensive. Also, the data processing of almost 15 trillion tokens couldn't be cheap  either.
But, fortunately, they released the model and many of their findings, so the AI community can  now benefit from all of this! Regarding DeepSeek, I hear people calling them a random startup and that  this model is just a side project. That's definitely not true.
They've been publishing lots of research  papers over the past year or two, such as all of these. . .
And, lots of excellent people from DeepSeek, and from academia were involved in these projects. So, there has been lots of intentional work behind  this. They also did many advanced optimizations on PTX, which is an intermediate instruction set  that sits between the GPU's native machine code and CUDA.
And, this is very hard to do! And, it  reflects the skills of those people, and I would say a lot of money and time was invested  into this. Overall, this is definitely a win for open research.
Before DeepSeek, only the largest  companies like OpenAI, Google, and Meta could train their models, and now it levels the field a bit and  allows more new players. For example, Hugging Face already started reproducing the R1 model, and I'm  very excited to see where this will go! And, maybe, even universities will be able to join the race  and come up with their own models now.
Like, I'm doing my PhD at Cambridge now, and there is no way  Cambridge would have invested billions into training models like "o1". Well, 5 million is still a  lot, but that's a different kind of discussion. The point is that it's getting more affordable, and  this is definitely a win for open research.
So that's all for today - please consider subscribing  if you liked the video and thank you for watching!
Copyright © 2025. Made with ♥ in London by YTScribe.com