DeepSeek is a Game Changer for AI - Computerphile

350.56k views4442 WordsCopy TextShare

Computerphile

An AI model that changed the fortunes of silicon valley overnight. Deep Seek has been released open ...

Video Transcript:

new day another another piece of AI is announced know why is this one so important we don't tend to do that many videos for the release of a new AI model just because there are a lot of them and lots of them are not that interesting right but in the last few days a model called Deep seek has come out and a new model called deepseeker R1 that are very interesting right and I think actually are really threatening the kind of Monopoly that certain companies have on this system and so let's talk about why that

is and why we should be really excited perhaps we should just step away for a minute and just for those people that have kind of not been paying attention maybe this is the first time you've ever watched a video with me in it what is a large language Model A large language model is a very very big Transformer based neural network that does next word prediction and so now there's a lot of jargon there so how do we get through some of that jargon all right yeah okay so a newal network is the sort of

a standard for machine learning things like convolutional new networks are very very popular in image-based computer vision you know AI for video these kind of things and Transformers since about 2017 have become a really big thing in generative AI but what we didn't know really was how far you can push them or how how good they can get and so you could kind of I suppose split generative AI in two you've got diffusion models which do image generation and you've got Transformers that do text generation and it's worth remembering that all modern AI of this

kind is text generation rather than kind of going off and you know coming up with some fundamental concepts of your own and coming back and so basically the way that you normally train these models is you get a huge huge model that is too big for most of us to train you get hundreds of thousands of of gpus possibly so these are the the graphics cards the graphics cards that train the model and then you just batch through all of the text on the internet learning how to predict the next word over and over again

until you get so good at it that you can start regurgitating facts you can even start solving logic problems or mathematical problems and things like this right but what's happened is over the last few years since you know chat GPT was announced in you know what was it 2022 I mean it's not even been that long there's been this kind of arms race among all the tech companies of who can build the biggest model who can build the most performant model and generally speaking their approach to doing this is to make them bigger to make

the data sets bigger to make the models bigger to make them more clever and train them until you get that better performance we've spoken about this in previous videos on how much you can keep doing that but that's kind of not what we're talking about today the idea is basically has always been that if you have 100 ,000 graphics cards if you have billions of dollars to spend you're going to be an advantage because you have that power that you can train the largest models now some companies like open AI keep their models behind the

scenes and you and they expose them through you know an API or a web interface or something like this and they very rarely give them away to anyone apart from perhaps close close Partners they also very rarely tell you exactly how they've trained the model what the data set was what the actual model parameters are and how big it is and things like this on the other hand a company like meta um so you'll know meta from Facebook um have a much more open AI policy where they basically release their models and those models are

called llama and they are very good and they are released for free and you can all use them and I've used them but regardless of whether you release the models They're Out Of Reach of most people I can run these models I can refine them I can't train them from scratch I don't have that kind of resource I don't have enough money to pay the electricity bill for the server Farm in most areas of science you might read a paper and you might go that's good but I think I can change the thing and make

it slightly better and then you release a paper that makes it slightly better then someone else does and someone else does and and everyone just gets slightly better at doing this thing that doesn't necessarily happen so much in these big AI models because ultimately there's only a few people that can do it and so this kind of openness I think I personally think I'm an academic so I would think this right but personally I think openness is a good thing now what's happened in the last few days is well and also just before the end

of the new end of the last year is that a small uh company in China has released a model called Deep seek right and there are a couple of flavors of this model we can talk about them but they've kind of changed the game a little bit right because what they've basically shown is that you can train with more limited Hardware still expensive but much more limited and you can train particularly the most recent variant of models much more efficiently in terms of the amount of data you need to collect which has been a huge

pain for everyone so perhaps we'll break down a little bit about some of the things that the model does that other models don't do and why this is going to be quite a big deal let's talk about V3 first so V3 is their kind of Flagship model that you can think of as a little bit like chat GPT right it's basically another large Transformer it's trained on lots and lots of text it has a convers I've spoken to it you know as much as one can speak to a chat a chat bot and it answers

perfectly reasonable questions and it acts in much the same way you would expect right and you can actually use it online now V3 offers lots and lots of different performance benefits over previous models which make it much much cheaper to train the company that trained it are claiming that they trained this model which is similar in terms of performance to llama to chat GPT in for $5 million of Hardware electricity right to give you an idea one of the largest models might be upwards of hundred million possibly towards a billion dollars right and you also

got to consider the fact that you may not hit the model the first time you might have multiple variants you're training right the amount of money and electricity being used to train these models is absolutely enormous there's a reason that Microsoft are exploring restarting up a nuclear power plant so it does this in in lots of different ways but I wanted to kind of talk about a couple of ways that I think are quite interesting so one of them is called mixture of experts which is mixture of experts one of the things that big models

have tried to do is everything right so the idea is you have one chat bot to rule them all and the idea is that if you ask ask it a math problem or you ask it a language problem you ask it a definition of something in physics it will be able to answer those questions and I think you know we've observed that that is often the case but not always the case right and the better you get at some tasks maybe you get slightly worse at other tasks so it's like a jack of all trades

yeah master of none kind of situation but the but there is another downside to this which is that it's very very expensive to do this right if you have a model with let's say four 500 billion parameters you have to store that in memory somewhere and when you run through it that all gets activated and you have to work out your mathematics throughout that whole model to get to the prediction so there's the training session which takes a lot of energy anyway and then also inference is what we call it yeah when you actually talk

to the thing and you've got to think that you're going through this network a lot of times because it probably only produces one or two tokens at a time and so it's not very efficient lots of AI models for classification of things will just go like that and they'll tell you what the answer is this tells you what the next word is and then it goes back and tells you what the next word is and it's hugely expensive to do this once it's been trained and maybe cost you $100 million because it's a giant model

it then is difficult to use because you have to have this on huge data centers so that enough people can talk to it at once and it's very very expensive even to ask it simple question so that's the thing that has been the case that's the previous one yeah what mix of experts does is it's trying to have different bits of a network focused on different tasks let's imagine you have a prompt here that comes in and you have a giant Network that finally gives you your answer over here right and this has let's say

670 billion parameters in it the problem is you don't know exactly which bid is is solving which problem right you might it might be a little bit over here and a little bit over here and actually the rest of it is kind of not useful in in solving this particular question because it's got other stuff that it's doing so maybe you ask it a very specific maths question what mix of experts will do is have trained a specific part of this network a much smaller part to solve that problem for you and so you basically

have the early stages will root the question to different parts of the network and then only activate a small part of it let's say 30 billion parameters which is a huge huge saving so this sort of shaded area here will activate and then that will produce your answer you can develop systems using agents like this where you have one that's trained to do this and one that's trained to do this and you just ask the right one right suppose I want to train a network to write my emails for Me Maybe it's very good at

that and then I train a different network to solve a different problem and I just asked the right one as opposed to hoping that one model can do it okay so that's much more efficient because you can distribute these different bits across a data center some of them can lie dormant when they're not being activated it's much cheaper to do this and so that's one of the reasons I think why deep seeks pricing is is already coming in very very low compared to the pricing of some of the other companies another thing that that is

being shown to work really well is so you could have this giant model and you could only activate certain parts of it that still requires quite a lot of infrastructure but what if you had a much smaller model and you used this to teach it what to do right and that's another thing that's becoming really really useful and people are already taking these giant models and using them to train smaller models so essentially this is a process called distillation and the idea is you take your 670 billion model you ask it a bunch of questions

in a certain field and you use those answers to train a smaller model to do the exact same thing and often it will work because actually a lot of the parameters weren't needed or we're solving some other task and so you can distill that problem solving into a small model so you can get pretty decent performance from an 8 billion model right which will run on standard Hardware I can run an 8 billion model on my on my computer and you'll get close to the same kind of performance certainly enough for for most use cases

just in a restricted field in a restricted field yeah you you wouldn't necessarily get I me you could actually get pretty General good generally good performance but yeah if you have a specific goal in mind it's very useful for that right um I should say actually they're just porting the general model so they're not necessarily restricting it to a certain field but your performance will probably be better if you do if you do but you know why use 670 billion parameters and all these gpus when you can just use 8 billion and run it on

your 4090 you know uh nice I'll do that the other thing they've done is they've made a lot of mathematical Savings in terms of the number of mathematical computations you have to do to go forward through a network right so we won't spend too much time talking about this maybe there's another video where we go into that in a bit more detail but essentially if you think that one of these networks might have thousands and thousands of very large Matrix multiplications where each of your matrices is two or 4,000 by 4,000 that is just a

huge amount of computation even for a fast machine very expensive to do but there are ways to make this more efficient and deeps aren't the only ones that have come up with these ways there's lots of people researching this we're starting to see networks that can be trained at a fraction of a cost because internally their parameters are used more efficiently so that's V3 right but that's actually not kind of what people are most excited about this is really good and training a network for this this amount of money is is hugely impressive um but

there's a few other things that they've done since then that have also been really quite special so R1 is the latest model and R1 performs something called Chain of Thought So what is Chain of Thought well Chain of Thought is something you might have seen if you've ever spoken to GPT 401 imagine I ask you to solve a long division problem right and I ask you just to save a number to me right that's going to be kind of difficult to do right for a big not you know for a Big M long division what

you would do is you write down the steps on a piece of paper and then you solve the problem based on those steps you work your way through it you work your way through it right and the observation of Chain of Thought is the same kind of idea it's quite difficult to ask a large language model just what is the solution to this logic problem or what is the solution to this mathematical derivation and it just spit an answer out because that's not trivial to do right there there's steps that it's not it's trying to

skip over and so what Chain of Thought does is essentially write down a step-by-step process of solving the problem and slowly solve it and then write down the answer right and then you just kind of hide the Chain of Thought you can show it or you can not show it as you see fit um but you tend to get much better at solving problems that require multiple steps if you want to just what is why is the sky blue it will just regurgitate that pretty easily from text is learned on the internet but if you're

asking like problem solving skills it's hard to do in one shot so you kind of take a little bit of time to just to just take you know to just work through it and this is essentially adding computational cost during inference but with the benefit that the performance goes up right now whether that cost is worth doing probably depends on the questions you're asking of it but it you know that's the idea now open AI pioneered this Chain of Thought um but they don't tell you how they it right so so because it's all closed

and so it's not open AI at all right in some sense so essentially you see a kind of pricey summary version of The Chain of Thought but it's not the internal actual internal monologue which is essentially a trade secret what R1 is doing is it's doing a Chain of Thought which is similar to 01 but it's fully public they've released all the models they've released all the code you can talk to it you can see the entire monologue and they've also trained it with a with massively more limited data so how would you train this

if you were open a or a big tech company well what you would do is you would you would give it you would create a data set that says here's a question here's the Chain of Thought You Should Have Been producing for this question and here's the answer right and you have to produce tens of thousands or hundreds of thousands of examples of these kind of things what what sorts things like like simple math problem math problems right the question I often ask which has been failed by things like chat gpq before is suppose you

have a stack of three boxes red blue and yellow the red is on the bottom on top of that is the blue on top of that is the yellow you take the blue one out and put it on the top and then you add a fourth green box onto the top can you describe that stack of boxes right and the answer is can you say it again because I need to write that down right that's that's the answer now I've asked chat GPT this and it often would fail at it until we got to the

new reasoning models the new Chain of Thought models which start to perform a lot better on this kind of task and we can actually see that happening here on deep c car 1 so if we look at the text it's actually started discussing with itself about how to solve the problem it comes up with some steps it goes through and it finally just produces the correct answer at the bottom so you don't have to use the Chain of Thought to look at if you don't want to from a research point of view it's quite interesting

and the fact that it's open is really positive but actually this is a problem would have been hard for this model to solve if it didn't have that Chain of Thought because it would have just had to basically look at the colors and move them into the output and that would have been very very hard so this Chain of Thought is what makes problems like this a bit more possible to actually train it to solve that problem what you would do is you'd give it a bunch of box stacking problems a bunch of derivations and

a bunch of solutions and step-by-step examples and then the answers at the bottom and over time it would learn to reproduce that that performance what R1 is doing is turning it on its head a bit and training only using the answers which is hugely easier because you need much less data you don't need to have crafted clever um you don't need to have crafted clever internal monologues the inter monologue comes out of the training process which is super cool so the way it works is you give it a question you and then you reward it

so reinforcement learning is this idea where it doesn't directly observe the actual correct answer it gets a reward or it doesn't get a reward based on whether it was right or not right so you maybe you want to train a a robot to walk along you don't say move your left leg this way what you do is reward it for getting a bit further and over time it might learn to walk along right and so they've done this what they've done is they've given it a load of maths problems and a load of maths answers

but they don't give it the answer they just tell it whether it was right and whether it was wrong and also they give it a small reward for having written some kind of internal monologue of the correct format and over time as it trains the monologue gets better and better and in the end it has a chat with itself and then solves the problem the really nice thing about it is that it's it's just much easier for someone like me to train a model like this right because because I can I just there are lots

of data sets with questions in and answers there are very many fewer data sets with really nice step-by-step instructions on how to solve the problem cuz I don't know how to solve the problems either by the way and so you know it makes this and they've released it all open source so it makes it much much easier to do so you know two weeks ago open and I was maybe the one of the only companies that could do something like this a handful of others and suddenly now you can kind of do it at home

you you will need for to train one of these models from scratch you may need lots and lots of graphics cards but massively limited numbers compared to what we had before it so big organizations let's say the size of this University could quite comfortably do this now as opposed to it being totally Out Of Reach which is a bit of a game Cher and it's worth noting I'm skipping over some of the cool stuff that they do there's lots of stuff in this paper and we can link it in the in the description and there's

loads of other ways that they train it in they do a multi-stage training process not just reinforcement learning to make it work a little bit better and make it appear more aesthetically pleasing but you know in principle the idea is what they've done is they've released a very performant model and told us exactly how they did it right which is very unusual for these kind of models and so you know in my opinion a good thing this has sent Silicon Valley into a bit of a spin hasn't it yes um which from my point of

view is someone not in Silicon Valley it's quite quite funny sometimes right you know I think sometimes this is pitches a bit of an arms race between different countries or different companies I think it's only an arms race because they make it an arms race right the rest of us are just cracking on with our regular research you know and that's the true of most people um but I think if you have a company where you've your whole business model is around you have the best model and no one knows how it works and can

copy you this really hurts that model right because they've got a good model and everyone knows how it works and can train it themselves right that's a huge problem the other problem is if you have a company like Nvidia where your stock price is is almost entirely based on the fact that these big companies buy hundreds of thousands of incredibly expensive gpus because that's what's required to get the best performance and then someone comes along and gets the best performance with essentially consumer Hardware that is also not a good look right now it might be

that those companies that have loads of gpus still have an advantage for a while but it's a leveler right over time people can do stuff with more limited hardware and I think that's a great thing because I have access we have access here to some to dozens of gpus and they're decent right and they're expensive but they're not anywhere near in the same league as some of these companies and so we we essentially cannot try those things right because we so we do other things but now we can and we might I might still not

but you know I I might and I and there it's an option for me you know so I think that's going to it's going to level the playing field a lot very very quickly and I think that once something like this starts lots of other companies will come up with new models we come up with new efficiency savings and it will just that that will increase more I we could be seeing the end of kind of closed Source AI because it may just not be viable that if you just keep adding more and more data

or bigger and bigger models or a combination of both ultimately you will move Beyond just recognizing cats and you'll be able to do anything right that's the idea you show enough cats and dogs and eventually the elephant just is implied