So DeepSeek is the most discussed thing in AI now and the main reason is basically this. It shows that the DeepSeek startup spent about 2. 8 million GPU hours on training their language model.
That would take one GPU more than 300 years to do this alone. But of course, they used about 2,000 GPUs so it took them about 2 months to train their model. Now, is this number good?
Is it bad? Well, let's compare it with Llama, for example. The largest Lama 3 model was trained for almost 31 million GPU hours.
That's about 11 times more than DeepSeek! Now, on top of all of this, they managed to train their model on relatively inferior GPU chips. Because of the US export regulations, Nvidia cannot currently sell their most powerful GPUs to China.
So, to bypass this problem, Nvidia nerfed their chips and instead of H100s China now gets less powerful H800s. And these nerfed versions seem cheaper to run, apparently only $2 per hour, which gets us to this cost of 5. 6 million dollars.
That's orders of magnitude less than the training cost for models like GPT and Gemini. And the thing is that until now, only these big players like OpenAI, Google, and Meta could could be training such large models due to the costs, and the rest of us had to wait. And that might change because of DeepSeek now.
So we have much less training time, on cheaper GPUs, and the DeepSeek models are also dominating the benchmarks. The Chatbot Arena shows that they are currently on par even with the latest models from OpenAI and Google. So how is this all possible?
Well, there are actually a few caveats to these numbers and things to unpack like the actual cost of training, which is higher than this. . .
But before I get to that, let's first cover how DeepSeek models work and why they are so much faster. Because, there have been lots of things circulating in the media recently, and sometimes I feel like the authors haven't even read any of the papers. So, DeepSeek recently published two models- V3 and R1.
And let's start with V3, which is the third iteration of their main foundation language model, and it's basically another big language model like Gemini or Llama. One of the major changes that made V3 so much faster is that it uses something called "Mixture-of-Experts", or MoE. And the idea behind MoE is that instead of processing input by a single large model, we divide the model into smaller networks, called experts, where each expert specializes in something unique - like a different domain or a different type of syntax.
And when we process the input, we don't pass it to all experts. Instead, we use an additional smaller network, called router, which looks at the given input and maybe says something like: "Hmm, I think only the first and the last experts should process this input". And for the next input, it picks a different set of experts and so on.
MoE is applied on every layer of the transformer network, and it actually replaces all dense layers. And so if this works, then this is great, because first of all, we're processing the input in these specialized sub-networks that can do the task better. And second, and that's even more important here, we're saving a lot of computation.
Because all other experts are being completely ignored. MoE as a concept is nothing new. It dates back to the early '90s, and for example, it was recently used in an LLM called Mixtral.
But, the problem is that it can be hard to get it working well as many things can go wrong here. For example, how can you actually separate the knowledge and make sure that each expert is specialized? Also, there is a well-known problem called "routing collapse", where the router learns to use only one or very few experts every single time and then ignores the rest.
So, you need to somehow encourage the router to select different experts and spread the knowledge. But, deep seek implemented several improvements here! First, they increase the number of experts.
. . I mean, sure why not?
But then, they also added additional shared experts, which are always selected. And the idea behind this is that these shared experts will learn the general knowledge across different contexts, which will leave the other non-shared experts to learn more specialized knowledge. These ideas are nothing novel and have been implemented in NLP many times before, but it's quite impressive that they've got it working on such a large scale.
So, now, you will also know what the term "activated parameters" means. Maybe you've noticed that the DeepSeek website mentions their model having a total of 671 billion parameters but only 37 billion of them are activated. That simply means that for each input token, the network selects only a specific number of experts and the total number of parameters across the selected experts is always 37 billion.
However, that doesn't mean you can just load 37 billion parameters on your GPU. You still need to load the entire model if you want to use it because as you generate tokens, you'll be using different sets of experts each time and will eventually need all of them. For instance, the paper for Mixtral has a very nice example with MoE for code generation.
Here, they colored each expert output in a different color and you can see that it's not all just one color, right? The model is using all sorts of experts all the time, and here it seems that it's more aligned with syntax rather than domain, actually. So, when you see charts like this on NBC news, for example, I think they can be quite misleading.
It shows that DeepSeek uses fewer parameters than Llama and Qwen during "interactions", which is a very ambiguous word. When you interact with a model, you will need all of them. And some charts in papers can be a bit confusing too.
Like this one from the DeepSeek paper on mixture of experts. They're showing that their model uses less than three billion parameters, and it is as good as Llama-2 with 7 billion parameters. But when it comes to performance, you will still end up using all 16 billion of them.
Just not at the same time. However, when it comes to speed, that's a different story because MoE allows us to do far less computation. And lowering their computational requirements and moving less data between their GPUs was actually essential to their success.
Because, as I've already mentioned, China doesn't have access to the most powerful GPUs right now. Last thing I want to mention about V3 is that they train the model mostly in FP8 precision. That means, each parameter was represented by 8 bits.
People usually use FP16 or BF16 for training, which is much more stable, but it requires double the memory - 16 bits. Well, not something DeepSeek wanted to do, right? So, yeah.
. . this on its own can basically double your training speed.
But, it's not so simple. They had to use this mixed-precision framework, where the most compute density operations are done in FP8 and some other more important computations are done in higher precisions. And, it seems like it worked well!
If we can consistently replicate this, everyone will most likely start doing training in FP8 now. The second model released by DeepSeek is called R1. And, R1 is a reasoning model, similar to OpenAI's o1 or o3.
The difference between non-reasoning models and reasoning models is that when we use a non-reason model like Llama, we kind of ask it to spit out the answer right away. Whereas the reasoning model will first generate lots of reasoning text, which gives the model space and time to self-reflect and explore different reasoning paths. And the final answer is generated at the end.
But, OpenAI hides this internal reasoning output from us. So, it's harder to understand how it works, verify the reasoning, and also to replicate OpenAI's work. And DeepSeek is the first company that figured it out!
They published a model that matches the performance of the OpenAI's reasoning models, and they told us everything about it! So how does it work? First, R1 builds on top of V3.
They already have this huge 700 billion parameter language model. And spend those $6 million on it. And so now DeepSeek just wants to teach it how to reason better.
One way of learning this complex reasoning behaviour would be to train the model on examples of such reasoning. For example, if you want high-quality data, you could have human annotators solving math problems, explaining their reasoning step by step, writing it down, and then you take that and use it for supervised learning. But, that would be quite expensive!
So, instead, they were like: What if we only use reinforcement learning? Basically, they led the model to generate lots of solutions for a given problem, and then they used a rule-based reward system to identify correct answers and correct reasoning steps, and then they reinforced those outputs so that the model is more likely to generate them again. And, as they train their model, they noticed one interesting behaviour.
The average length of response was slowly increasing and increasing. Basically, the model was learning to reason for longer and longer. Of course, that could just mean that it's producing longer and longer nonsense.
But, no! These are the final results, and they are very close to OpenAI's reasoning models. But, their conclusion is that this model still struggles with challenges like poor readability and language mixing.
So, they actually decided to collect a small amount of high-quality data and use it to fine-tune the V3 model first before reinforcement learning. And that's the final R1 model, which scores very high on all benchmarks. Now, you might think, surely that's all we got from DeepSeek.
Nope, they actually did one more thing that's really good for us! They took R1, asked it a few thousand questions, and then they trained models like Llama and Qwen on those outputs. So, they basically distilled the reasoning from the large model into these smaller models.
And these distilled models are what you actually want to be using if you plan to try DeepSeek locally because the original models are just way too big. But, using these distilled models is super easy. You can just use Ollama, choose any of these options, and.
. . I will do "ollama run deepseek-r1:8b" and then just wait.
. . and wait.
. . and eventually, we get their model running!
Now, a few thoughts on the training cost and DeepSeek as a company. $5. 6 million is just the cost of their final training run - a single run!
To get there, they had to do lots of research. And they had to run many trainings in parallel to test everything this fast. We can see in their papers lots of interesting innovations.
But, when you have so many moving parts, you have to ablate all of them to be sure that those decisions are justified before they get into the final product. That all had to be very expensive. Also, the data processing of almost 15 trillion tokens couldn't be cheap either.
But, fortunately, they released the model and many of their findings, so the AI community can now benefit from all of this! Regarding DeepSeek, I hear people calling them a random startup and that this model is just a side project. That's definitely not true.
They've been publishing lots of research papers over the past year or two, such as all of these. . .
And, lots of excellent people from DeepSeek, and from academia were involved in these projects. So, there has been lots of intentional work behind this. They also did many advanced optimizations on PTX, which is an intermediate instruction set that sits between the GPU's native machine code and CUDA.
And, this is very hard to do! And, it reflects the skills of those people, and I would say a lot of money and time was invested into this. Overall, this is definitely a win for open research.
Before DeepSeek, only the largest companies like OpenAI, Google, and Meta could train their models, and now it levels the field a bit and allows more new players. For example, Hugging Face already started reproducing the R1 model, and I'm very excited to see where this will go! And, maybe, even universities will be able to join the race and come up with their own models now.
Like, I'm doing my PhD at Cambridge now, and there is no way Cambridge would have invested billions into training models like "o1". Well, 5 million is still a lot, but that's a different kind of discussion. The point is that it's getting more affordable, and this is definitely a win for open research.
So that's all for today - please consider subscribing if you liked the video and thank you for watching!