DeepSeek-R1 Crash Course

414.03k views17614 WordsCopy TextShare
freeCodeCamp.org
Learn how to use DeepSeek-R1 in this crash course for beginners. Learn about the innovative reinforc...
Video Transcript:
hey this is angrew brown and in this crash course I'm going to show you the basics of deep seek so first we're going to look at the Deep seek website where uh you can utilize it just like use tgpt after that we will download it using AMA and have an idea of its capabilities there um then we'll use another tool called um Studio LM which will allow us to run the model locally but have a bit of an agentic Behavior we're going to use an aipc and also a modern Gra card my RTX 480 I'm
going to show you some of the skills about troubleshooting with it and we do run into issues with both machines but it gives you kind of an idea of the capabilities of what we can use with deep seek and where it's not going to work I also show you how to work with it uh with hugging face with Transformers and to uh to do local inference um so you know hopefully you uh excited to learn that but we will have a bit of a primer just before we jump in it so we know what deep
seek is and I'll see you there in one one second before we jump into deep seek let's learn a little bit about it so deep seek is a Chinese a company that creates openweight llms um that's its proper name I cannot pronounce it DC has many uh open open weight models so we have R1 R1 Z deep seek ver uh V3 math coder Moe soe mixture of experts and then deep seek V3 is mixture of models um I would tell you more about those but I never remember what those are they're somewhere in my ni
Essentials course um the one we're going to be focusing on is mostly R1 we will look at V3 initially because that is what is utilized on deep seek.com and I want to show you uh the AI power assistant there but let's talk more about R1 and before we can talk about R1 we need to know a little bit about r10 so there is a paper where you can read all about um how deep seek works but um deep seek r10 is a model trained via large scale reinforcement learning with without without supervised fine tuning and
demonstrates remarkable reasoning capabilities r10 has problems like poor readability and language mixing so R1 was trained further to mitigate those issues and it can achieve performance comparable to open ai1 and um they have a bunch of benchmarks across the board and they're basically showing the one in blue is uh deep seek and then you can see opening eyes there and most of the time they're suggesting that deep seek is performing better um and I need to point out that deep seek R1 is just text generation it doesn't do anything else but um it supposedly does
really really well but they're comparing probably the 271 billion parameter model the model that we cannot run but maybe large organizations can uh affordab uh at uh afford at an affordable rate but the reason why deep seek is such a big deal is that it is speculated that it has a 95 to 97 reduction in cost compared to open AI that is the big deal here because these models to train them to run them is millions and millions of millions of dollars and hundreds of millions of dollars and they said they trained and built this
model with $5 million which is nothing uh compared to these other ones and uh with the talk about deep c car one we saw like a chip manufacturers stocks drop because companies are like why do we need all this expensive compute when clearly these uh models can be optimized further so we are going to explore uh deep SE guard 1 and see how we can get her to run and see uh where we can get it run and where we're going to hit the limits with it um I do want to talk about what Hardware
I'm going to be utilizing because it really is dependent on your local hardware um we could run this in Cloud but it's not really worth it to do it you really should be investing some money into local hardware and learning what you can and can't run based on your limitations but what I have is an Intel lunar Lake AI PC dev kit its proper name is the core Ultra 200 um V series and this came out in September 2024 it is a mobile chip um and uh the chip is special because it has an igpu
so an integrated Graphics unit that's what the LM is going to use it has an mpu which is intended for um smaller models um but uh that's what I'm going to run it on the other one that we're going to run it on is my Precision 30 uh 3680 Tower workstation oplex I just got this station it's okay um it is a 14th generation I IE 9 and I have a g GeForce RTX 480 and so I ran this model on both of them I would say that the dedicated graphics card did do better because
they just generally do but from a cost perspective the the lake AI PC dev kit is cheaper you cannot buy the one on the Le hand side because this is something that Intel sent me they there are equivalent kits out there if you just type an AIP PC dev kit Intel am all of uh uh quadcom they all make them so I just prefer to use Intel Hardware um but you know whichever one you want to utilize even the Mac M4 would be in the same kind of line of these things um that you could
utilize but I found that we could run about a 7 to8 billion parameter model on either but there were cases where um when I used specific things and the models weren't optimize and I didn't tweak them it would literally hang the computer and shut them down both of them right both of them so there is some finessing here and understanding how your work your Hardware works but probably if you want to run this stuff you would probably want to have um a computer on your network so like I my aipc is on my network or
you might want to have a dedicated computer with multiple graphics cards to do it but I kind of feel like if I really wanted decent performance I probably need two aips with distributed uh Distributing the llm across them with something like racer or I need another other graphics card uh with distributed because just having one of either or just feels a little bit too too little but you can run this stuff and you can get some interesting results but we'll jump into that right now okay so before we try to work with deep seek programmatically
let's go ahead and use deep seek.com um AI powered assistance so this is supposed to be the Civ of Chachi BT Claude Sonet mistal 7 llamas uh meta AI um as far as I understand this is completely free um it could be limited in the future because this is a product coming out of China and for whatever reason it might not work in North America in some future so if that doesn't work you'll just skip on to the other videos in this crash course which will show you how to programmatically download the open-source model and
run it on your local compute but this one in particular is running deep seek version or V3 um and then up here we have deep seek R1 which they're talking about and that's the one that we're going to try to run locally but deep seek V3 is going to be more capable because there's a lot more stuff that's moving around uh in the background there so what we'll do is go click Start now now I got logged in right away because I connected with my Google account that is something that's really really easy to do
and um the use case that I like to test these things on is I created this um prompt document for uh helping me learn Japanese and so basically what the uh this prompt document does is I tell it you are a Japanese language teacher and you are going to help me work through a translation and so I have one where I did on meta Claud and chat gbt so we're just going to take this one and try to apply it to deep seek the one that's most advanced is the claw one and here you can
click into here and you can see I have a role I have a language I have teaching instructions we have agent flow so it's handling State we're giving it very specific instructions we have examples and so um hopefully what I can do is give it these documents and it will act appropriately so um this is in my GitHub and it's completely open source or open to you to access at Omen King free gen I boot camp 2025 in the sentence Constructor but what I'm going to do is I'm in GitHub and I'm logged in but
if I press period this will open this up in I'm just opening this in github.com um but what I did is over time I made it more advanced and the cloud one is the one that we really want to test out so I have um these and so I want this one here this is a teaching test that's fine I have examp and I have consideration examples okay so I'm just carefully reading this I'm just trying to decide which ones I want I actually want uh almost all of these I want I I'm just going
to download the folder so I'm going to do I'm going to go ahead and download this folder I'm going to just download this to my desktop okay and uh it doesn't like it unless it's in a folder so I'm going to go ahead and just hit download again I think I actually made a folder on my desktop called No Maybe not download but we'll just make a new one called download okay I'm going to go in here and select we'll say view save changes and that's going to download those files to there so if I
go to my desktop here I go into download we now have the same files okay so what I want to do next is I want to go back over to deep seek and it appears that we can attach file so it says text extraction only upload docs or images so it looks like we can upload multiple documents and these are very small documents and so I want to grab this one this one this one this one and this one and I'm going to go ahead and drag it on in here okay and actually I'm going
to take out the prompt MD and I'm actually just going to copy its contents in here because the prompt MD tells it to look at those other files so we go ahead and copy this okay we'll paste it in here we enter and then we'll see how it performs another thing we should check is its Vision ability but we'll go here and says let's break down a sentence example for S structure um looks really really good so next possible answerers try formatting the first clue so I'm going to try to tell it to give me
the answer just give me the answer I want to see if it if I can subvert uh subvert my instructions okay and so it's giving me the answer which is not supposed to supposed to be doing did I tell you not to give me the answer in my prompt document let's see if it knows my apologies for providing the answer clearly so already it's failed on that but I mean it's still really powerful and the consideration is like even if it's not as capable as Claude or as Chach BT it's just the cost Factor um
but it really depends on what these models are doing because when you look at meta AI right if you look at meta AI or you look at uh mistol mistol 7 uh these models they're not necessarily working with a bunch of other models um and so there might be additional steps that um Claude or chat GPT uh is doing so that it doesn't like it makes sure that it actually reads your model but so far right like I ran it on these ones as well but here are equivalents of of more simpler ones that don't
do all those extra checks so it's probably more comparable to compare it to like mistol 7 or llama in terms of its reasoning but here you can see it already made a mistake but we were able to correct it but still this is pretty good um so I mean that's fine but let's go test its Vision capabilities because I believe that this does have Vision capabilities so I'm going to go ahead and I'm looking for some kind of image so I'm going to say Japanese text right I'm going to go to images here and um
uh we'll say Japanese menu in Japanese again if even if you don't care about it it's it's a very good test language as um is it really has to work hard to try to figure it out and so I'm trying to find a Japanese menu in Japanese so what I'm going to do is say translate maybe we'll just go to like a Japanese websit so we'll say Japanese Hotel um and so or or maybe you know what's better we'll say Japanese newspaper that might be better and so this is probably one minichi okay uh and
I want it actually in Japanese so that's that's the struggle here today um so I'm looking for the Japanese version um I don't want it in English let's try this Japanese time. JP I do not want it in English I want it in Japanese um and so I'm just looking for that here just give me a second okay I went back to this first one in the top right corner it says Japanese and so I'll click this here so now we have some Japanese text now if this model was built by China I would imagine
that they probably really good with Chinese characters and and Japanese borrow Chinese characters and so it should perform really well so what I'm going to do is I'm going to go ahead I have no idea what this is about we we'll go ahead and grab this image here and so now that is there I'm going to go back over to deep seek and I'm going to just start a new chat and I'm going to paste this image in I'm going to say can you uh transcribe uh the Japanese text um in this image because this
what we want to find out can it do this because if it can do that that makes it a very capable model and transcribing means extract out the text now I didn't tell it to um produce the the translation it says this test discusses the scandal of involving a former Talent etc etc uh you know can you translate the text and break down break down the grammar and so what we're trying to do is say break it down so we can see what it says uh formatting is not the oh here we go here this
is what we want um so just carefully looking at this possessive advancement to ask a question voices also yeah it looks like it's doing what it's supposed to be doing so yeah it can do Vision so that's a really big deal uh but is V3 and that makes sense but this is deeps seek this one but the question will be what can we actually run locally as there has been claims that this thing does not require series gpus and I have the the hardware to test that out on so we'll do that in the next
video but this was just showing you how to use the AI power assistant if you didn't know where it was okay all right so in this video we're going to start learning how to download the model locally because imagine if deep seek is not available one day for whatever reason um and uh again it's supposed to run really well on computers that do not have uh expensive GP gpus um and so that's what we're going to find out here um the computer that I'm on right now I'm actually remoted like I'm connected on my network
to my Intel developer kit and this thing um if you probably bought it brand new it's between $500 to $1,000 but the fact is is that this this thing is a is a is a mobile chip I call it the lunar Lake but it's actually called The Core Ultra 200 V series mobile processors and this is the kind of processor that you could imagine will be in your phone in the next year or two um but what's so special about um these new types of chips is that when you think of having a chip you
just think of CPUs and then you hear about gpus being an extra graphics card but these things have a built-in graphics card called an igpu an integrated graphics card it has an mpu a neural Processing Unit um and just a bunch of other capabilities so basically they've crammed a bunch of stuff onto a single chip um and it's supposed to allow you to uh be able to run ml models and be able to download them so this is something that you might want to invest in you could probably do this on a Mac M4 as
well or uh some other things but this is just the hardware that I have um and I do recommend it but anyway one of the easiest ways that we can work with the model is by using olama so AMA is something I already have installed you just download and install it and once it's installed it usually appears over here and mine is over here okay but the way olama works is that you have to do everything via the terminal so I'm on Windows 11 here I'm going to open up terminal if you're on a Mac
same process you open up terminal um and now that I'm in here I can type the word okay so AMA is here and if it's running it shows a little AMA somewhere in in your on your computer so what I want to do is go over to here and you can see it's showing us R1 okay but notice here there's a drop down okay and we have 7 1.5 billion 7 billion 8 billion 14 billion 32 billion 70 billion 671 billion so when they're talking about deep seek R1 being as good as chat gpts they're
usually comparing the top one the 671 billion parameter one which is 404 GB I don't even have enough room to download this on my computer and so you have to understand that this would require you to have actual gpus or more complex setups I've seen somebody um there's a video that circulates around that somebody bought a bunch of mac Minis and stack them let me see if I can find that for you quickly all right so I found the video and here is the person that is running they have 1 two three three four five
six seven seven Mac Minis and it says they're running deep seek R1 and you can see that it says M4 Mac minis U and it says total unified memory 496 gab right so that's a lot of memory first of all um and it is kind of using gpus because these M M4 chips are just like the lunar Lake chip that I have in that they have integrated Graphics units they have mpus but you see that they need a lot of them and so you can if you have a bunch of these technically run them and
I again I again I whatever you want to invest in you know you only need really one of these of whether it is like the Intel lunar lake or the at Mac M4 whatever ryzen's AMD ryzen's one is um but the point is like even if you were to stack them all and have them and network them together and do distributed compute which You' use something like Ray um to do that Ray serve you'll notice like look at the type speed it is not it's not fast it's like clunk clunk clun clunk clunk clunk clunk
clunk so you know understand that you can do it but you're not going to get that from home unless the hardware improves or you buy seven of these but that doesn't mean that we can't run uh some of these other uh models right but you do need to invest in something uh like this thing and then add it to your network because you know buying a graphics card then you have to buy a whole computer and it gets really expensive so I really do believe in aip's but we'll go back over to here and so
we're not running this one there's no way we're able to run this one um but we can probably run easily the seven billion parameter one I think that one is is doable we definitely can do the one 1.5 billion one and so this is really what we're targeting right it's probably the 7even billion parameter model so to download this I all I have to do is copy this command here I already have Olam installed and what it's going to do it's going to download the model for me so it's now pulling it from uh probably
from hugging face okay so we go to hugging face and we say uh deep seek R1 what it's doing is it's grabbing it from here it's grabbing it from uh from hugging face and it's probably this one there are some variants under here which I'm not 100% certain here but you can see there's distills of other of other models underneath which is kind of interesting but this is probably the one that is being downloaded right now at least I think it is and normally what we looking for here is we have these uh safe tensor
files and we have a bunch of them so I'm not exactly sure we'll figure that out here in a little bit but the point is is that we are downloading it right now if we go back over to here you can see it's almost downloaded so it doesn't take that long um but you can see they're a little bit large but I should have enough RAM on this computer um I'm not sure how much this comes with just give me a moment so uh what I did is I just open up opened up system information
and then down below here it's it's saying I have 32 GB of RAM so the ram matters because you have to have enough RAM to hold this stuff in memory and also if the model's large you have to be able to download it and then you also need um the gpus for it but you can see this is almost done so I'm just going to pause here until it's 100% done and it should once it's done it should automatically just start working and we'll we'll see there in a moment okay just showing that it's still
pulling so um it downloaded now it's pulling additional containers I'm not exactly sure what it's doing but now it is ready so it didn't take that long just a few minutes and we'll just say hello how are you and that's pretty decent so that's going at an okay Pace um could I download a more um a more intensive one that is the question that we have here because we're at the seven billion we could have done the 8 billion why did I do seven when I could have done eight the question is like where does
it start kind of chugging it might be at the 14 14 billion parameter model we'll just test this again so hello and just try this again but you can see see that we're getting pretty pretty decent results um the thing is even if you had a smaller model through fine-tuning if we can finetune this model we can get better performance for very specific tasks if that's what we want to do but this one seems okay so I would actually kind of be curious to go ahead and launch it I can hear the computer spinning up
from here the lunar Lake um devit but I'm going to go ahead and just type in buy and um I'm going to just go here I want to delete um that one so I'm going to say remove and was deep c car 1 first let's list the model here because we want to be cautious of the space that we have on here and this model is great I just want to have more um I just want to run I just want to run the 8 billion parameter one or something larger so we'll say remove this
okay it's deleted and I'm pretty confident it can run the 8 billion let's do the 14 billion parameter this is where it might struggle and the question is how large is this this is 10 gabes I definitely have room for that so I'm going to go ahead and download this one and then once we have that we'll decide what it is that we want to do with it okay so we're going to go ahead and download that I'll be back here when this is done downloading okay all right so we now have um this model
running and I'm just going to go ahead and type hello and surprisingly it's doing okay now you can't hear it but as soon as I typed I can hear my uh my little Intel developer kit is going and so I just want you to know like if you were to buy IPC the one that I have is um not for sale but if you look up one it has a lunar Lake chip in it uh that Ultra core was it the ultra core uh uh 20 20 2 220 or whatever um if you just find
it with another provider like if it's with Asus or whoever Intel is partnered with you can get the same thing it's the same Hardware in it um Intel just does not sell them direct they always do it through a partner but you can see here that we can actually work with it um I'm not sure how long this would work for it might it might quit at some point but at least we have some way to work with it and so AMA is one way that we can um get this model but obviously there are
different ones like the Deep seek R1 I'm going to go ahead back to AMA here and I just want to now uh delete that model just because we're done here but there's another way that uh we can work with it I think it's called notebook LM or LM Studio we'll do in the next video and that will give you more of a um AI powed assistant experience so not necessarily working with it programmatically but um closer to the end result that we want um I'm not going to delete the model just yet here but if
you want to I've already showed you how to do that but we're going to look at the uh next one in the next video here because it might require you to have ol as the way that you download the model but we'll go find out okay so see you in the next one all right so here we're at Studio LM or LM Studio I've actually never used this product before I usually use web UI which will hook up to AMA um but I've heard really good things about this one and so I figured we'll just
go open it up and let's see if we can get a very similar experience to um uh having like a chat gbt experience and so here you they have downloads for uh Mac uh the metal series which are the the latest ones windows and Linux so you can see here that they're suggesting that you want to have one of these new AI PC chips um as that is usually the case if you have gpus then you can probably use gpus I actually do have really good gpus I have a 480 RTX here but I want
to show you what you can utilize locally um so what we'll do is just wait for this to download okay and now let's go ahead and install this but I'm really curious on how we are going to um plug this into like how are we going to download the model right does it plug into AMA does it download the model separately that's what we're going to find out here just shortly when it's done installing so we'll just wait a moment here okay all right so now we have completing the ml Studio um setup so LM
Studio has been installed on your computer click finish and set up so we'll go ahead and hit finish okay so this will just open up here we'll give it a moment to open I think in the last video we stopped olama so even if it's not there we might want to I'm just going to close it out here again it might require oama we'll find out here moment so say get your first llm so here it says um llama through 3.2 that's not what we want so we're going to go down below here it says
enable local LM service on login so it sounds like what we need to do is we need to log in here and make an account I don't see a login I don't so we'll go back over to here and they have this onboarding step so I'm going to go and we'll Skip onboarding and let's see if we can figure out how to install this just a moment so I'm noticing at the top here we have select a model to load no LMS yet download the one to get started I mean yes llama 3.1 is cool
but it's not the model that I want right I want that specific one and so this is what I'm trying to figure out it's in the bottom left corner we have some options here um and I know it's hard to read I apologize but there's no way I can make the font larger unfortunately but they have the LM studio. a so we'll go over to here I'm going go to the model catalog and and we're looking for deep seek we have deep seek math 7 billion which is fine but I just want the normal deep
seek model we have deep seek coder version two so that'd be cool if we wanted to do some coding we have distilled ones we have R1 distilled so we have llama 8 billion distilled and quen 7 billion so I would think we probably want the Llama 8 billion distilled okay so here it says use in LM studio so I'm going to go ahead and click it and we'll click open okay now it's going to download them all so 4.9 gigabytes we'll go ahead and do that so that model is now downloading so we'll wait for
that to finish okay so it looks like we don't need Olam at all this is like all inclusive one thing to go though I do want to point out notice that it has a GG UF file so that makes me think that it is using like whatever llama index can use I think it's called llama index that this is what's compatible and same thing with o llama so they might be sharing the same the same stuff because they're both using ggf files this is still downloading but while I'm here I might as well just talk
about what uh distilled model is so you'll notice that it's saying like R1 distilled llama 8 or quen 7 billion parameter so dist distillation is where you are taking a larger model's knowledge and you're doing knowledge transfer to a smaller model so it runs more efficiently but has the same capabilities of it um the process is complicated I explain it in my Jenning ey Essentials course which this this part of this crash course will probably get rolled into later on um but basically it's just it's a it's a technique to transfer that knowledge and there's
a lot of ways to do it so I can't uh summarize it here but that's why you're seeing distilled versions of those things so basically theyve figured out a way to take the knowledge maybe they're querying directly that's probably what they're doing is like they have a bunch of um evaluations like quer that they hit uh with um uh what do you call it llama or these other models and then they look at the result and then they then when they get their smaller model to do the same thing then it performs just as well
so the model is done we're going to go ahead and load the model and so now I'm just going to get my head a little bit out of the way cuz I'm kind of in the way here so now we have an experience that is more like uh what we expected to be and on the top here I wonder is a way that I can definitely bring the font up here I'm not sure if there is a dark mode the light Mode's okay but um a dark mode would be nicer but there's a lot of
options around here so just open settings in the bottom right corner and here we do have some themes there we go that's a little bit easier and I do apologize for the small fonts um there's not much I can do about it I even told it to go larger this is one way we can do it so let's see if we can interact with this so we'll say um can you um I am learning Japanese can you act as my Japanese teacher let's see how it does now this is R1 this does not mean that
it has Vision capabilities um as I believe that is a different model and I'm again I'm hearing my my computer spinning up in the background but here you can see that it's thinking okay so I'm trying to learn Japanese and I came across the problem where I have to translate I'm eating sushi into Japanese first I know that in Japanese the order of subject can be this so it's really interesting it's going through a thought process so um normally when you use something like web UI it's literally using the model directly almost like you're using
it as a playground but this one actually has reasoning built in which is really interesting I didn't know that it had that so there literally is uh agent thinking capability this is not specific to um uh open seek I think if we brought in any model it would do this and so it's showing us the reasoning that it's doing here as it's working through this so we're going to let it think and wait till it finishes but it's really cool to see its reasoning uh where normally you wouldn't see this right so you know when
and Chach B says it's thinking this is the stuff that it actually is doing in the background that it doesn't fully tell you but we'll let it work here we'll be back in just a moment okay all right so looks like I lost my connection this sometimes happens because when you are running a computational task it can halt all the resources on your machine so this model was a bit smaller but um I was still running ol in the background so what I'm going to do is I'm going to go my Intel machine I can
see it rebooting in the background here I'm going to give it a moment to reboot here I'm going to reconnect I'm going to make sure llama is not running and then we'll try that again okay so be back in just a moment you know what it was the computer decided to do Windows updates so it didn't crash but this can happen when you're working with llms that it can exhaust all the resources so I'm going to wait till the update is done and I'll get my screen back up here in just a moment okay all
right so I'm reconnected to my machine I do actually have some tools here that probably tell me my use let me just open them up and see if anyone will actually tell me where my memory usage is yeah I wouldn't call that very uh useful maybe there's some kind of uh tool I can download so monitor memory usage well I guess activity monitor can just do it right um or what's it called see if I can open that up here try remember the hot key for it there we go and we go to task manager
and so maybe I just have task manager open here we can kind of keep track of our memory usage um obviously Chrome likes to consume quite a bit here I'm actually not running OBS I'm not sure why it um automatically launched here oh you know what um oh I didn't open on this computer here okay so what I'll do is I'll just hit task manager that was my task manager in the background there we go and so here we can kind of get an idea this computer just restarted so it's getting it itself in order
here and so we can see our mem us is at 21% that's what we really want to keep a track of um so what I'm going to do is go back over to LM Studio we're going to open it up but this is stuff that really happens to me where it's like you're using local LMS and things crash and it's not a big deal just happens but we came back here and it actually did do it it said thought for 3 minutes and 4 seconds and you can see its reasoning here okay it says the
translation of I likeing Sushi into Japanese isi sushim Guk which is true the structure correctly places it one thing I'd like to ask it is can it give me um Japanese characters so can you show me the uh the sentence can you show me uh Japanese using Japanese characters DG conji and herana okay and so we'll go ahead and do that it doesn't have a model selected so we'll go to the top here what's kind of interesting is that maybe you can switch between different kinds of models as you're working here we do have GPU
offload of discrete uh model layers I don't know how to configure any of these things right now um flash attention would be really good so decrease memory usage generation time on some models that is where a model is trained on flash attention which we don't have here right now but I'm going to go ahead I'm going to load the Llama distilled model and we're going to go ahead and ask if it can do this for us because that would make it a little bit more useful okay so I'm going to go ahead and run that
and we'll be back here in just a moment and we'll see the results all right we are back and we can take a look at the results here we'll just give it a moment I'm going to scroll up and you know what's really interesting is that um it is working every time I do this I it does work but the computer restarts and I think the reason why is that it's exhausting all possible resources um now the size of the model is not large it's whatever it is the 8 billion parameter one at least I
think that's what we're running here um it's a bit hard because it says 8 billion uh distilled and so we'd have to take a closer look at it it says 8 billion so it's 8 billion parameter um but the thing is it's the reasoning that's happening behind the scenes and so um I think for that it's exhausting whereas we're when we're using llama it's less of an issue um and I think it might just be that LM Studio the way the agent Works might might not have ways of or at least I don't know how
to configure it to make sure that it doesn't uh uh destroy destroy stuff when it runs out here because you'll notice here that we can set the context length and so maybe if I reduce that keep model in memory so Reserve System memory for the model even when offload GPU improves performance but requires more RAM so here you know we might toggle this off and get better production but right now when I run it it is restarting but the thing is it is working so you can see here it thought for 21 seconds it says
of course I'd like to help you and so here's some examples and it's producing pretty good code or like output I should say but anyway what we've done here is we've just changed a few options so I'm saying don't keep it in memory okay because that might be an issue and we'll bring the context window down and it says CPU uh thread to allocate that seems fine to me again I'm not sure about any of these other options we're going to reload this model okay so we're now loading with those options I want to try
one more time if my computer restarts it's not a big deal but again it might be just LM Studio that's causing us these issues here and so I'm just going to click into this one I think it's set up those settings we'll go ahead and just say Okay um so I'm going to just say like how do I ask how do I I say in Japanese um uh where is the movie theater okay it doesn't matter if you know Japanese it's just we're trying to tax it with something hard so here it's running again and
it's going to start thinking we'll give it a moment here and as it's doing that I'm going to open up task manager he and we'll give it a moment I noticed that it has my um did it restart again yeah I did so yeah this is just the experience again it has nothing to do with the Intel machine it's just this is what happens when your resources get exhausted and so it's going to restart again but this is the best I can de demonstrate it here now I can try to run this on my main
machine using the RTX 480 um so that might be another option that we can do where I actually have dedicated GP use and I have a this is like a 14th generation uh Intel chip I think it's Raptor lake so maybe we'll try that as well in a separate video here just to see what happens um but that was the example there but I could definitely see how having more than uh like those computer stacked would make this a lot easier even if you had a second one there that' still be uh more cost effective
than buying a completely new computer outright those two or smaller mini PCS um but I'll be back here in just a moment okay okay so I'm going to get this installed on my main machine my main machine like as I'm recording here it's using my GPU so it's going to have to share it so I'm just going to stop this video and then we're going to treat this one as LM Studio using the RTX 480 and we'll just see uh if the experience is the same or different okay all right so I'm back here and
now I'm on my main computer um and we're going to use ml studio so I'm going to go and skip the onboarding and I remember uh there's a way for us to change the theme maybe in the bottom right corner of the Cog and we'll change it to dark mode here thr our eyes are a little bit uh easier to see here also want to bump up the font a little bit um to select the model I'm going to go here to select a model at the top here we do not want that model here
so I'm going to go to maybe here on left hand side no not there um it was here in the bottom left corner and we're going to go to L LM Studio Ai and we want to make our way over to the model catalog at the top right corner and I'm looking for deep seek R1 distill llama 8B so we click that here and we'll say use in studio that's now going to download this locally okay so we are now going to download this model and I'll be back here in just a moment okay all
right so I've downloaded the model here I'm going to go ahead and load it and again I'm a little bit concerned because I feel like it's going to cause this computer to restart but because it's uh offloading to the gpus I'm hoping that'll be less of an issue but here you can see it's loading the model into memory okay and we really should look at our options that we have here um it doesn't make it very easy to select them but oh here it is right here okay so we have some options here and this
one actually is offloading to the GPU so you see it has GPU offload I'm almost wondering if I should have set GPU offload um on the aipc because it technically has IG gpus and maybe that's where we were running into issues whereas when we were using olama maybe it was already utilizing the gpus I don't know um but anyway what I want to do is go ahead and ask the same thing so I'm going to say uh can you teach me teach me Japanese for jlpt and5 level so we'll go ahead and do that we'll
hit enter and again I love how it shows us the thinking that it does here I'm assuming that it's using um our RTX RTX 480 that I have on this computer and this is going pretty decently fast here it's not causing my computer to cry this is very good this is actually reasonably good so yeah it's performing really well so the question is um you know I again I'd like to go try the uh the the developer kit again and see if I because I remember the gpus were not offloading right so maybe it didn't
detect the igpus but this thing is going pretty darn quick here and so that was really really good um and so it's giving me a bunch of stuff it's like okay but give me give me example sentences in Japanese okay so that's what I want we'll give it a moment yep and that looks good so it is producing really good stuff this model again is just the Llama uh a building parameter one I'm going to eject this model let's go back over to here into the uh Studio over here and I want to go to
the model Catal because there are other deep seek models so we go and take a look deep seek we have coder version two so the younger sibling of GPT 4 deeps coder version 2 model but that sounds like deep seek 2 right so I'm not sure if that's really the latest one because we only want to focus on R1 and so yeah I don't think those other ones we really care about we only care about R1 models but you can see we're getting really good performance so the question is like what's the compute or the
top difference between these two and maybe we can ask this over to the model ourselves but I'm going to start a new conversation here and I'm going to say um how many tops or or is it tops does I think it's called tops tops does RTX uh 4080 have okay we'll see if it can do it select this model here and yeah we'll load the model and we'll run that there we'll give it a moment and while that's thinking I mean obviously we just use Google for this we don't really need to do that but
I want to do a comparison to see like how many tops they have so I'll let that run the background I'm also just going to search and find out very quickly oh here it goes uh does not have a specified number of tensor uh as officially NV video the company focuses on metrics like cudas cores and mamory B withd but this would be speculative okay but but then but then how do I how do I compare compare tops for um let's say lunar Lake versus RTX 4080 and I know like there's lots of ways to
do it but it's like if I can't compare it how do I do it and while that's trying to figure it out I'm going to go over to perplexity and maybe we can get an exact example because I'm trying to understand like how much does my discret GPU do compared to that that one that's internal so we'll say uh lunar lunar Lake versus RTX uh 40 4080 uh for Tops performance and we'll see what we get so lunar lake has 120 tops and hence gaming rather than AI workload so IND doesn't typically advertise their tops
maintaining 60 FPS okay but then how so then okay but what what could it be like how many tops could it be for the RTX 480 kind of makes it hard because like we don't know how many tops it is we don't we don't know what kind of expectation we should have with it okay fair enough so yeah so it's we can't really compare it's like apples to oranges I guess and it's just not going to give us the result here um but here it is going through comparison so if you run ml perfect gpus
like a model with reset you directly compare the tops uh with a new architecture and so that's basically the only way to do it so we can't it's apples to oranges um I want to go and attempt to try to run this one more time on the lunar Lake and I want to see if I can set the gpus but if we can't set the gpus then I think it's going to always have that issue specifically with this but we will use the L Lake for um using with hugging face and other things like that
so be back in just a moment okay all right so I'm back and I just did a little bit of exploration on my other computer there because I want to understand like okay I have this aipc it's very easy to run this here on my RTX 480 but when I run it on the on the uh the lunar like it is shutting down and I think understand why and so this is I think is really important when you are working local machines you have to have a bit better understanding of the hardware so I'm just
going to RDP back into this machine here just give me just a moment okay I have it running again and it probably will crash again but at least I know why so there's a program called camp and what camp does is it allows you to monitor um your this is for Windows for Mac I don't know what You' use you probably just uh uh utility manager but here you know I can see that none of these CPUs are being overloaded but this is just showing us the CPUs if we open up um task manager here
okay and now the computer is running perfectly fine it's not even spinning its fans if I go to the left hand side here we can we have CPUs mpus and gpus now mpus are the things that we want to use because mpus uh like an mpu is specifically designed to run models however a lot of the Frameworks like Pi torch um and uh tensor flow they're optimized on Cuda originally because the underlying framework and so normally you have to go through an optimization or conversion format I don't know at this time if there is a
conversion for Max for Intel Hardware Because deep seek is so new but I would imagine uh that is something the Intel team is probably working on and this is not just specific to Intel if it's AMD or whoever they want to make optimization to leverage their different kinds of compute like their MPS and also has to do with the the thing that we're using so we're using that thing called this one over here I'm not sure well all these little oh yeah this this just this is core LM showing us all the temperatures right and
so what we can do is just kind of see what's going on here is that I'm going to bring this over so that we can see what's happening right we want to use mpus it's not going to happen because this thing is not set up to do that but if I drop it down here and we click into uh this right we have our options before we didn't have any gpus but we can go here we can say use all the gpus I don't know how many how much it can offload to but I'll I'll
set it to something like 24 we have a CPU threat count like that might be something we want to increase we can reduce our context window um we might not want to load it into memory but the point is that if it if it exhausts the GPU because it's all it's a single integrated circuit I have a feeling that it's going to end up restarting it but here again you can see it's very low we'll go ahead and we'll load the model right and the next thing I will do is I will go type in
something like you know I want to learn Japanese can you provide me um uh a lesson on Japanese sentence structure okay we'll go ahead and do that actually notice if it this doesn't require a thought process it works perfectly it doesn't cause any issues with the computer we'll go ahead and run it and let's pay attention left hand side here and now we can see that it's utilizing gpus when it was at zero it wasn't using gpus at all but Noti it's at 50 50% right and it's doing pretty good our CPU is higher than
usual before when I ran this earlier off screen the CPU was really low and it was the GPU that was working hard so again it really you have to understand your settings as you go here but this is not exhausting so far but we're just watching these numbers here and also our cor temps right and you can see we're not running into any issues it's not even spinning up it's not even making any complaints right now the other challenge is that I have a a developer kit that um uh it's it's something they don't sell
right so if there was an issue with the BIOS I'd have to update it and there's like no all I can get is Intel's help on it but if I to buy like a commercial version of this like um whoever is partnered with it if it's Asus or Lenovo or whatever I would probably have um less issues because they're maintaining those bios updates um but so far we're not having issues but again we're just monitoring here we have 46 47% 41% um again we're watching it you can see core is at 84% 89% and so
we're just carefully watching this stuff but I might have picked the perfect the perfect amount of settings here and maybe that was the thing is that you know I turned down the CPU like what did we do the options I turned the gpus down so I turned that down I also told it not to load memory and now it's not crashing okay there we go it's not as fast as the RTX 4080 um but you know what this is my old graphics card here I actually bought this uh not even long ago before I got
my new computer this is an RTX 3060 okay this is not that old it's like a it's like a couple years old 2022 and I would say that when I used to use that and I would run models my computer would crash right so but the point is is that these newer CPUs whether it's again the M4 or the Intel L lake or whatever amd's one is they're they have the strong equivalence of like graphics cards from two years ago which is crazy to me um but anyway I think I might have found The Sweet
Spot I'm just really really lucky but you can see the memory usage here and stuff like that and you just have to kind of monitor it and you'll find out once you get those settings uh what works for you or you know you buy really expensive GPU and uh it'll run perfectly fine but here it's going and we'll just give it a moment we be back in just a moment okay anyway I was going a little bit slow so you know I just decided we'll just move on here but my my point was made clear
is that if you dial in the specific settings you can make this stuff work on things where you don't have dedicated graphics card if you have a dedicated graphics card you can see it's pretty good and uh yeah this is fine with the RTX 480 so you know if you have that you're going to be in good shape there but now that we've shown how to do with AI power assistance let's take a look at how we can actually get these models from hugging face next okay and work with them programmatically um so I'll see
you in the next one all right so what I want to do in this video is I want to see if we can download the model from hugging phase and then work with it programmatically um is that's going to give you the most flexibility with these models of course if you just want to consume them then uh using the um LM Studio that I showed you or whatever it was called um would be the easiest way to do it but having a better understanding of these models how we can use them directly would be useful
I think for the rest of this I'm just going to use the RTX 480 because I realize that to really make use of aips you have to wait till they have optimizers for it so we're talking about um Intel again you have this kit called open Veno and open Veno is an optimization framework and if we go down they I think they have like a bunch of examples here we'll go back for a moment yeah quick examples maybe over here and maybe not over here but we go back to the notebooks and we scroll on
down yeah they have this page here and so um in this thing they will have different llms that are optimized specifically so that you can maybe Leverage The mpus or the or or make it run better on CPUs but until that's out there we're stuck on the gpus and we're not going to get the best performance that we can uh so maybe in a in a month or so um I can revisit that and then I will be utilizing it it might be as fast as my RTX 480 but for now we're going to just
stick uh with the RTX 480 and we'll go look at Deep seek because they have more than just R1 so you can see there is a collection of models and in here if we click into it we have um R1 r10 which I don't know what that is let's go take a look here it probably explains it somewhere uh but we have R1 distilled 70 billion PR parameter quen 32 billion parameter quen 14 billion and so we have some variant here that we can utilize just give me a moment I want to see what zero
is so to me it sounds like zero is the precursor to R1 so it says a model trained with supervised learning okay and so I don't think we want to use zero we want to use the R1 model or one of these distilled versions which uh give similar capabilities but if we go over to here it's not 100% clear on how we can run this um but down below here we can see total parameters is 671 billion okay so this one literally is the big one this is the really really big one and so that
would be a little bit too hard for us to run this machine we can't run 671 billion parameters you saw the person stacking all those uh Apple m4s like uh yeah I have an RTX 480 but I need a bunch of those to do it down below we have the distilled models and so this is probably what we were using when we were using olama um if we wanted to go ahead and do that there so this is probably where I would focus my attention on is these distilled models uh when we're using hugging face
it will show us how we can deploy the models up here notice over here we have BLM um I covered this in my geni essentials course I believe but um there are different types of ways we can serve models just as web servers have you know servers to serve them like the uh like software underneath so do um uh these ml models these machine learning models and VM is one that you want to pay attention attention to because it can work with the ray framework and Ray is important because um say Ray uh I'll just
say ml here but this framework specifically has a product within it um called racer it's not showing me the graphic here but racer allows you to take VM and distribute it across comput so when we saw that video of that again those Mac m4s being stacked on top of each other that was probably using racer with v LM to scale it out and so if you were to run this uh run this you might want to invest in VM the hugging face Transformer library is fine as well but again we're not going to be able
to run this on my computer and not on your computer uh so we're going to go back here for a moment but there's also uh V3 which has been very popular as well and that actually is what we were using when we went to the Deep seek website but if we go over to here and we go into deep seek uh three I think this is yeah this one's a mixture of experts model and this would be a really interesting want to deploy as well but it's also 67 uh 71 billion parameter model so it's
another one that we can't deploy locally right but if we did we could have like Vision tasks and all these other things that maybe it could do so we're going to really just have to stick with the R1 and it's going to be with one of these distributions I'm going to go with the Llama uh 8 billion parameter I don't know why we don't see the other ones there but 8 billion is something we know that we can reibly run whether it's on the lunar lake or if it's on the RTX 480 and so I'm
going to go over here in the right hand side and we have Transformers and VMS Transformers is probably the easiest way to run it and so we can see that we have some code here so I'm going to get set up here um I'm going to just open up vs code and I already have a Repel I'm going to put this in my geni essentials course because I figured if we're going to do it we might as well put it in there and so I'm going to go and open that folder here and I need
to go up a directory I might not even have this cloned so I'm going to just go and grab this directory really quickly here so just CD back and I do not so I'm going to go over to GitHub this repo is completely open so if you want to do the same thing you can do this as well we're going to say gen Essentials okay and um I'm going to go ahead and just co uh copy this and download it here so give it a clone get clone and I'm going to go ahead and open
this up um I'm going to open this with wind Surfer fun because I really like wind surf I've been using that quite a bit if I have it installed here should yeah I do I have a paid version of wind surf so I have full access to it if you don't just you can just copy and paste the code but I'm trying to save myself some time here so we're going to open this up I'm going to go into the Gen Essentials I'm going to make a new folder in here I'm going to call this
one deep seek and I want to go inside of this one and call it um R1 uh Transformers cuz we're going to just use the Transformers library to do this I'm going to select that folder we're going to say yes I'm going to make a new file here and I probably want to make this an iron python file um I'm not sure if I'm set up for that but we'll give it a go so what we'll do is we'll type in basic. [Music] ironpython uh ynb which is for uh jupyter notebooks and you'd have to
already have jupyter installed if you don't know in my gen Essentials I show you how to set the stuff up so so you can learn it that way if you want I'm going to go over to WSL here and um yeah I'll install that extension there if it wants to install there and I'm going to see if I have cond installed I should have it installed there it is and we have a base so anytime that you are um setting up one of these environments you should really make a new one because that way you'll
run into less conflicts and so I need to set up a new environment I can't remember the instructions but I'm pretty certain I show that somewhere here at local development in this folder and so if I go to cond and I go into setup I think I explain it here so for Linux and that's what I'm using right now with Windows subsystem Linux 2 is I would need to it's already installed so I want to create a new environment so I probably want to use Python 3.0.0 if it's a future you might want to use
312 but this version seems to give me the least amount of problems so I want this command but I want to change it a little bit I don't want it to be hello I want to call this deep seek so we'll go back over to here we're we're going to paste it into here and um so now we are uh setting up python 310 and it's going to install some stuff okay so now we are uh good I need to activate that so I'm say cond activate deep seek and so now we are using deep
seek I'm going to go back here on the right hand left hand side and what I want to do is I want to get some code set up here so if we go back over to here into the 8 billion uh distilled model we go to Transformers we have some code and if it doesn't work that's totally fine we will we will tweak it from there I also have example code lying around so for whatever reason this doesn't work sorry I just paused there for a second if it doesn't work we can uh grab from
my code base here because I don't always remember how to do this stuff even though I've done a lot of this I don't remember half the stuff that I do so we're going to go ahead here and cut this up and put this up here but we're going to need um I'm not sure how well uh um uh I'm not sure how well um uh wind surf Works within uh jup ir and python I actually never did that before so it's asking us it's asking us to start something you need to select a kernel and
I'm going to say oh it's not seen the kernels that I want but you know one thing I don't think we did is I don't think we installed iron python so there's an extra step that we're supposed to do to get it to work with Jupiter and it might be under our Jupiter instructions here where yes it's this so we need to make sure we install iron python kernel otherwise it might not show up here so I'm going to just go ahead here and um I'm going to do cond cond whoops cond hyphen Fonda Forge
so we're saying downloads from the cond forge and and I think it's cond install so it's cond install hyphen f cond Forge and then we paste in IP kernel and so now it should install IP kernel I'm not sure if that uh worked or not we'll go up here and take a look the following packages are not available for in installation oh it's hyphen c not hyphen f okay so we'll go here and that just means to use the condo Forge and so this should resolve our issue so we're going to install ipy kernel right
give it a second it we'll say yes okay and so I'm hoping what that will do is that we'll be able to actually select the kernel we might have to close that wind Surf and reopen it we can do the same thing in vs code it's the same interface right so I'm not seeing it showing up here so I'm just going to close that wind surf it would have been nice to use wind surf but if we can't that's totally fine I'm going to go ahead and open this again I'm going to open up the
Gen Essentials I'm just going to say open I'm not using any AI coding assistant here so we're just going to work through it the oldfashioned way and somewhere in here we have a deep seek folder I'm going to go ahead and make a new terminal here I want to make sure that I'm in in WSL which I am I'm going to say cond to activate deep seek because that's where I need to go so I now have that activated I'm going to go into the deep seek folder into our R1 Transformers folder um I'm looking
for the Deep seek folder there it is we'll click into it and I did not save any of the code which is totally fine it's not like it's too far away to get this code again and so I'm going to go back over to here and we are going to grab this code okay I'm going to paste it in and we'll make a new code block and I want to grab this and put this below okay now normally we' install pytorch and some other things um but I'm going to just try from the most barebones
thing it's going to tell me Transformers isn't installed and that's totally fine and I'm just trying to there we go do this so we'll run that and so I'm going to go here to install Jupiter oh it's installing Jupiter I see okay so we do need that maybe the kernel would have worked um and so I'm going to go to python environments python environments and so now we have deep seek so maybe we could have got it to work with W serve but that's fine so we don't have Transformers installed there's no modules called Transformers
I know that we do this before so we might as well go leverage code and see what we did here before here we have hugging face basic and so here yeah we do an install with Transformer so that's all we really need there's P Pi dot. EnV we might also need that as well because we might need to put in our hugging face API to download the model I'm not sure at this point but I'll go ahead and just install that up here in the top okay so we'll give that a moment to install it
shouldn't take too long we might also need to install P torch or or tensor flow or both um that's very common when you are working with open source models is that they may be in one for format or another and need to be converted over um sometimes you don't need to do it at all but we'll see so now it's saying to restart so we'll just do a restart here we should only have to do that once and so I'm going to go ahead Here and Now include it so now we have less of an
issue here it's showing us this model so basically this will download it specifically from hugging phas so if we grab this address here and we go back over to wherever um I had one open here just a moment ago and it should match this address right so if I was to just delete this out here and put it in here it's the same address right so that's how it knows what model it's grabbing but we'll go back over to here um and it doesn't look like we need our hugging face API but we'll we'll find
out here in just a moment so it should download it we'll get a message here we'll load Transformers we'll have tokenizers then we'll have the model um the messages here is being passed into here says copy local model directory directly okay so I think here it's like we just have two different ones we have one that's using the pre-train one yes there's two ways that we can do it I think we cover this uh when you use a direct model or a pipeline and so let's go ahead and see if we can just use the
pipeline okay and if I don't remember how to do this we probably go over here and take a look um I don't remember everything that I do but yeah this is the one we just had open here just a moment ago the basic one and so this has a pipeline and then we just use it and so this in a sense should just work so let's go ahead and see if that works so I'm just going to separate this out so I don't have to continually run this we'll cut this out okay we'll run that
and then we'll run this okay and we'll go down below and it says at least one tensor flow or pie torch should be installed to install tensor flow do this and so this is what I figured we were going to run into where it's complain like hey you need P torch or tensor flow um I don't know which one it needs I would think that it was safe tensorflow because I saw that and so I'm going to just go ahead and make a new one up here I'm really just guessing I'm going to go say
uh tensor flow and I'm also going to just say p torch let's just install both because it'll need one or the other and one of them will work assuming I spelled it right two competing Frameworks I learned uh tensor flow first and then uh I kind of regret that because P torch is now the most po even though I really like tensor flow or specifically kirz but we'll give this a moment to install and then once we do that we'll run it again and we'll see what happens okay so it's saying P torch failed to
build and I hope that doesn't matter because if it uses tensor flow it's fine but it says failed to build installable wheels so just a moment here as was my twin sister calling me uh she doesn't know I'm recording right now so I'm going to go ahead and restart this even though we don't have P torch or it might be wrong it might be installed I'm not sure we're going to go ahead and just try it again anyway um because sometimes this stuff just works anyway and we'll run it and so it is complaining it's
saying at least one one of tensorflow or P should be installed install tensorflow 2.0 uh to install P torch read the instructions here um okay so I mean this shouldn't be such a huge issue so I'm going to go and let's use deep seek since we are big deep seek fans here today but I'm going to go over to the Deep seek website which is running V3 it's not even using the R1 um I'm going to log into here we'll give it a moment and we'll go here and say um you know I want to
uh I need to install tensor flow 2.0 and pytorch to run uh a Transformers pipeline model so we'll give that a go and see what we get so here it's specifically saying to use 2.0 yeah and it's always a little bit tricky so I'm going to go back up to here and maybe we can say equals 2.0.0 I mean what it it did install tensor flow 20 we don't need to tell it to do two again so we go down below here and let me just carefully look here so at least one of tensorflow 2.0
or py to should be install to install it you should have it the select framework tensor for the Pyar to use the model pass returns a tuple framework oh so it's asking which model to use as it doesn't know okay so I'm going to go back over to here and I'm going to say like you know give it this thing and see if it can figure it out and it's not exactly what I want so I'm going to just stop it here I'm just saying like I am using Transformers pipeline how do I specify uh
the framework okay I'm I'm surprised I have to specify the framework usually it just picks it up okay and so here we have Pi torture tensor flow I think tensorflow successfully installed I'm not sure if it's just guessing because this thing could be hallucinating we don't know uh but we'll go ahead and just give this a try and we'll run this here and here it's saying um we're still getting that right so I'm going to go over to here this probably is a common hugging face issue for tensor flow somebody has commented here you need
to have P torch installed mhm so let's say deep C I don't know if there's anyone that's actually told us how to do this yet give me a second let me see if I can figure it out all right so I went over and we're asking Claude instead and so maybe Claude again because it's not just the model itself but it's the reasoning behind it and so V3 didn't really get us very far it's supposed to be a really good model um but um here it's suggesting that um P torch is generally used and maybe
my instructions here is incorrect and so it's suggesting to do um I mean we have tensorflow which is fine but here it's suggesting that we do torch um torch and accelerate okay so I'm going to go ahead and run this here so maybe Pi torch is just torch and I just forgot I don't know why I wrote in pi torch we'll give that a moment we'll see what happens uh the other thing is that it's saying that we probably don't need the framework specify because well it's saying for llama in particular that it normally uses
P torch I'm not sure if that's the case here um another thing that we could do is go take a look at hugging face or sorry not hugging face yeah hugging face and look at the files here and I'm seeing tensorflow files so it makes me think that it is using tensorflow but maybe it needs to convert it over to P torch I don't know but um we should have both installed so even though I removed it from the top there um tensorflow is still installed and we could just leave it uh there as a
separate line with say pip install um tensor flow this is half the battle to to get these things to work is is dealing with these conflicts and you will get something completely different than me and you have to work through it but we'll wait for this um it would be interesting to see if we could serve this via a VM but we'll just first work this way okay all right so that's now installed I'm going to go to the top here and we're going to give it a restart and so now we should have those
installed we'll go ahead and do Transformers pipelines and we'll go run this next and so now it's working so that's really good um um is it utilizing my gpus I would think so sometimes there's some configurations here that you have to set but I didn't set anything here I think right now it's just downloading the model so we're going to wait for the model to download and then we just want to see if it infers um I'm not sure why it's uh not getting here but maybe it'll take a moment to get going um we
didn't provide it any hugging face API key so maybe that's the issue it's kind of hanging here so it makes me really think that I need my hugging face API key so what I'm going to do is I'm going to grab this code over here because I just assume that it wants it that's probably what it is and sorry I'm going to just pull this up here oops we'll paste this in here as such and I'm going to drag this on up here and I'm going to just make a new env. text I'm also going
to just ignore that because I don't want it to end up in there and um it's like hugging face API key I never remember what it is um but we'll go take a look here I'm just doing this off screen here so say hugging face API key nbar okay so key where are you key I'm having a hard time finding the name of the environment variable right now uh oh it's a HF token that's what it is so I need the HF token and I'm going to go back here and see if it's actually downloaded
at all did it did it move at all no it hasn't so I don't think it's going to move and I think it's because it needs um I think it needs the hugging face API key so I'm over here in hugging face and I have an account you go over down below you go to access tokens I got to log in one sec all right and so I'm going to create a new token it's going to be read only this will be for deep deep spe deep uh deep seek there was no settings that I
had to accept to be able to download it so I think it's going to work I'm going to get rid of my key later on so I don't care if you see it um I'm in this file here so that was called uh HF token I believe HF token and so now we have our token supposedly set we'll go back over to here I'm going to go and scroll up and I'm going to run this and now it should know about my token I shouldn't even have to set it I don't think so maybe it'll
download now I'm not sure I'm go back over to this one notice we're not pumping the token in anywhere I'm just going to bring this also down by one this is acting a little bit funny here today I'm not sure why like why is going all the way down there it's probably just the way the messaging works here I'm going to cut this here and paste it down below so I'm really just trying to get this to trigger and I mean this one's this other one here but it's not it's not doing anything another way
we could do it is we could just download it directly I don't like doing it that way but we could also do it that way but I'm just looking for the hugging face uh token and bars yeah it's HF HF tokens yeah so I have it right but why it's not downloading I don't know um let's go take a look at that page and just make sure that there wasn't anything that we had to accept sometimes that's a requirement where it's like hey if you don't accept the things they won't give you access to it
so if I go over here to the model card it doesn't show anything that I have to select to download this [Music] yeah there's nothing here whatsoever right so again just carefully looking here we have some safe tensors that's fine oh here it goes okay so we just had to be a little bit patient it's probably a really popular model right now and that's probably why it's so hard to download but um I'm just going to wait till this is done downloading I'll be back here in just a moment it's it's downloading and running the
pipe line okay I did put the print down below here so it might um execute it here might execute it up there we'll find out here in a moment this one might be redundant because I took it out while it was running live here but we'll wait for this to finish okay it's taking a significant time to download oh maybe it's just almost done here but um yeah downloading from shards getting the checkpoints now it's starting to write run saying Cuda zero I think that means it's going to utilize my gpus I'm pretty sure zero
is gpus and one is CPU I'm not sure why that is but um now it appears to be running okay so we'll just wait a little bit longer now the thing is is that once this model is downloaded right we can just call pipe every time and it'll be a lot faster right we'll wait a little bit longer okay all right I'm back here and um I mean it ran the first part of the pipeline which is fine but I guess I didn't run this line here so we'll run it and since we separate out
I think this one's defined hopefully it is and we'll run this and and it should work it's probably now just doing its thing trying to run but we'll give it a moment and we'll see uh what happens here okay yeah I don't think it should take this long to run I'm going to stop this and we're going to run this again and I think it'll be faster this time working because my video here is uh the video I'm recording here is kind of struggling that's why I like to use uh an external external thing here
because now my computer is [Music] hanging so what I might need to do here is pause if I can all right I'm kind of back um my uh my computer almost crashed again it's not I'm telling you it's not the the lunar Lake it's these things can exhaust all your resources and that's why it's really good to have an external computer that's specifically dedicated like an aipc or even a dedicated PC with gpus not on your main machine but um there is a tool here called Nvidia SMI and it will actually show us uh the
usage here and um it's probably not going to tell us much now because it's uh already running here but as this is running we can use this to figure out what is the usage of um gpus that are going on here but I'm going to go back up here for a moment we'll take a look so um it says CPU went out of memory so Cuda Colonels uh uh they cnly reported some API calls so this is what I mean where this could be a little bit challenging and again we downloaded the other models but
those other models that we saw and by the way I'll bring my head back in here so we stop seeing uh EOS uh webcam utility here but but the thing that we saw was that um uh when we used uh ol to download it was using ggf which is a format that is optimized to run on CPUs right and it can utilize gpus as well so it was already optimized whereas uh the model we're downloading is not optimized I don't think and um apparently I just don't have enough to run it at the 8 billion
parameter one but the question is is it downloading the correct one so if we go back over to here right this one is distilled 8 billion parameter it has to be it right because um because of that there and so we might actually not even be able to run this at least not in that format okay so you can see where the challenges are coming here so we go over to our files and we take a look here we can see we have a bunch of safe tensors that's not going to really help us that
much we got to go back into deep seek here and we'll look into um the ones that they have here well here's the question is it yeah we did the 8 billion 8 billion parameter one so we go into here 8 billion there is quen 7 billion which is a bit smaller there's also the 1.5 billion one that's not going to be useful for us but you know what I'm kind of exhausting my resources here so we can run this as an example and then if you had more resources like more RAM then you'll have
less of a problem so I'm going to go ahead and copy this over here and we're going to go ahead and paste it in here as such okay so now we are literally just using a smaller model because I don't think I have enough um uh memory in order to run this especially when I'm recording this at the same time and you know if we go over to here um I'm just typ in clear here um so we have fan temperature performance you can see none of the gpus are being used right now so if
we knew it if we knew that they would be showing up over here right the gpus and so right now I think it's just trying to attempt to download the model because we swapped out the model right so at some point here it should say hey we're downloading the model it's not for some reason but we'll give it a moment okay because the other one took a bit of time to get going so I'm going to pause until I see something all right so after waiting a while this one ran it says Cuda out of
memory Cuda external errors might be asynchronous reported at the API calls and stack and so it keeps running on a memory and I think that's more of an issue of this computer so I might have to restart and run this again so I'm going to be back I'm going to stop the video I'm going to restart it's the easiest way to dump memory because I don't know any other way to do it but you know if I go here I mean it shows no memory usage so I'm not really sure what the issue is but
I'm going to um restart I'm also going to close OBS I'm going to run it offline and then I'm going to tell I'm going to show you the results okay be back in just a moment all right I'm back and I also uh just went ahead and I ran it and this time it worked much faster so I'm not sure maybe it was holding on to the cache of the old one that was in here but giving my computer a nice restart really did help it out and you can see that we are getting the
model to run um I don't need to run the pipeline every single time I'm not sure why I ran that twice but I should be able to run this again again I'm recording so maybe this won't work well as it is utilizing the gpus we'll see [Music] here so now it's struggling but literally I ran this and it was almost instantaneous like how fast it was that it ran so yeah I think it might be fighting for resources um and that is that is a little bit tricky for me here we'll go back over here
to Nvidia SMI I mean I'm not seeing any of the processes being utilized so it's kind of hard to tell what's going on here but I'm going to go ahead and just stop this can I stop this but it clearly works so even though I can't show you yeah see over here says volatile GPU utilization 100% And then down here it says 33% I thought that these cores would start spitting up so we could we could make sense of it and then here I guess is the memory usage so over here you could see we
have 790 of 8 818 and here we can see kind of the limits of it but if I run it again you can see that my me recording just this video is using up uh the memory so that kind of makes it a bit of a challenge um and the only way I could do that is maybe if I was to uh use onboard Graphics which um are not working for me um because I don't know if I even have any onboard Graphics but that's okay so anyway um that's our that's our example here that
we got working it clearly does work I would like to try to do another video where we use VM but I'm not sure if that is possible um but we'll consider this part done and if there's a video after that then you know that I was able to get BLM to work see you the next one all right that's my crash course into uh deep seek I want to give you some of my thoughts about how I think our crash course went here and what we learned as we were working through it um one thing
I realized is that um in order to run these models uh you really do need optimized models and when we're using ama if if you remember it had the ggf extension that's that file format that is um more optimized to run on CPUs I know that with llama index um for my gen Essentials course when I did that exploration so optimized models are going to make these things a lot more accessible when we were using uh notebook LM or whatever it was called uh we saw that it was it wasn't notebook LM it was LM
Studio notebook LM is a Google product but LM Studio it was adding that extra thought processes and so so more things were happening there it was exhausting the machine um even on my main machine where I have an RTX 480 which was really good you could see that it ran ran well but then when we were trying to work with it directly where we didn't have an optimized model that we were downloading um my computer was restarting so it was exhausting both my machines trying to run it though I think on this machine because I
was using OBS it is using a lot of my resources but uh there's a video that I did not add to this where I was trying to run it on VM and I was even trying to use 1.5 the 1.5 billion uh quen distilled model and it was saying I was running out of memory so you can see this stuff is really really tricky um and even with an RTX 480 and with my lunar Lake um there were challenges but there are areas that we can utilize it I don't think we're exactly there yet to
have a full AI powered assistant with with thought and reasoning um but the RTX 480 kind of handled it if that if that's all you're using it for and you're restarting those conversations um and then you're fine tuning those some of those things down and then the lunar could do it if if we tuned it down one thing that I did say that um I realize after doing a bit more research CU I forget all the stuff that I learned but mpus are not really designed to use LMS I was saying earlier maybe there's a
way to optimize it or something but mpus are designed to run smaller models alongside your llms for your workloads so you can distribute uh a more complex AI workload so maybe you have an llm and it has a smaller model that does something like images or something something I don't know something um and maybe you can utilize that mpus um but you know we're not going to ever at least in the next couple years we're not going to see anything utilizing mpus to run llms it's really the gpus and so we are really fixed on
that the igpu on the lunar Lake and then what our RTX the RTX 4080 can do um so you know maybe if I had another graphics card and I actually do I have a 3060 but unfortunately the computer I bought doesn't allow me to slot in slotted in so if there was a way I could distribute the compute from this computer and my old computer or even the lunar Lake as well then I bet I could run something that is a little bit better um but you know you probably want uh like a a homebuilt
computer with two graphics cards in it or you want multiple multiple uh aips that are stacked that have distributed compute um and just as as we saw that video where the person was running the uh 671 billion parameter model if you paid close attention to um the uh the the uh post it actually said in there that it was running it on 4 bit quantization so that wasn't just the model running at its full Precision it was running it highly quantized and so quantization can be good but if it's at four bit that's really small
and so and it was chugging along so you know the question really is is like okay even if you had seven or eight of those you'd still have to quantize it which is not easy and it's still even it's still slow and would the results be any good so as a example it was cool but I think that 271 billion parameter model is really far Out Of Reach um but that means we can try to Target one of these other ones like if it's 70 70 billion billion parameter model or maybe we just want to
reliably run the 7 billion building parameter model by having one extra computer and so you're looking at depending if if you're smart about it 1,000 ,500 and then you can uh run a model it's not going to be as good as these as Chachi BT or Claude but it definitely will pave the way there um we'll just have to continue to wait for these models to be optimized and for uh the hardware to improve or the cost to go down but maybe we're just two computers away or two graphics cards away um but yeah that's
my two cents there and I'll see you in the next one okay ciao
Copyright © 2025. Made with ♥ in London by YTScribe.com