All You Need To Know About Running LLMs Locally

150.27k views2114 WordsCopy TextShare
bycloud
RTX4080 SUPER giveaway! Sign-up for NVIDIA's GTC2024: https://nvda.ws/48s4tmc Giveaway participation...
Video Transcript:
last year we thought 2024 was going to be a job market hell but it turns out that not only we have even more hiring frees we also found ourselves even deeper into the subscription nightmare that is AI Services because for just 20 bucks a month you would get your own green Jor that is good at coding but bumbles at the same time and can write you some good emails that could have been only 10 words some may say that it's money well spent but some may say that you are not maxing enough because if I
could run freee Bots near equivalent to chat gbt myself why would I pay 20 bucks a month for a service just to tell me I can only use it again at 8 p.m. so if you ever want to know how to run AI chop Bots and LM models locally this video is the perfect Gateway for you to get started picking the right user interface for yourself is very important as it'll be catering to your kns depending on how deep you are in the rabbit hole first we have text generation web UI and it usually goes
by the name uaba it offers three modes default which is basic input output chat which is like a dialogue format and notebook which is pretty much like text completion it's the current most popular UI and offers most of the basic stuff you need second we have silly Tarvin which focuses more on the front-end experience of using AI chat bot like chatting role playing and even more visual novel like presentation but it's just generally a really nice looking front end for using LMS and since it's only a front end it will need a back end like
uaba to run the AI models but if if you just want some straightforward exe file then there is LM Studio which has lots of great native functions like hugging face model browser which makes finding models much more easier other than that it has some much nicer quality of life info that would tell you if you can run a model or not it's definitely a good alternative if you don't like the gradio type of interface you can hop between models swiftly and you can use it as an API for other apps too lastly we have Axel
AO and it is a command line interface that offers the best support for fin fine tuning AI models so if you do get deep into fine tuning this is definitely your first choice rather than doing it on ubera for this video I'll be using uaba as it provides the most well-rounded functionalities with support on most operating systems and available for NVIDIA AMD and apple M series you can also add silly Tarver as the front end if you want and follow these installation steps and you're good to go next you can start browsing all the free
and open source models on hugging face and download what you want to use using U goa's built-in downloader by copy and pasting the last two URL slugs but if you don't have any models in mind here's a list of models I recommend along with their best usage and beside their model names it usually has their own version and a number followed by a b which means how many billion parameters the model has and it often acts as an indicator for you if you can run it on your GPU or not but if it starts doing
maths on its name or it says Moe it means that it's a mixture of experts model which I explained in my previous video you can check it out for a better approximation on how much vrm you need for each model you can also refer to these hugging face spaces sometimes you would see some other funny looking letters on their names to like ggf awq safe tensors EXL 2 gptq but they are all kind of meaningless until you start running your first model and encounters runtime error Cuda out of memory well safe tensor is technically unrelated
to this since it's just a secure file format that prevents people from adding sus things to your PC when you load up a model from an unknown source but others are different model formats that potentially let you run them when you typically can because of their large parameters count and these formats achieve this by either shrinking down their Precision or quantizing them which is essentially reducing the number of bits used to represent a number that the model uses this would usually result in lobotomizing these models but it's not as bad nowadays since the loss isn't
as noticeable when used on the better models for the others ggf a predecessor of gml is the file format in binary for models that supports different quantization scheme running on CPU and is contained within a single file which most models are not by the way EXL 2 is a format used by X Lama V2 which mixes quantization levels within a model to achieve an average bit rate between 2 and 8 bits per weight it is the fastest optimization but only available on Nvidia gpus awq is a quantization method that sets smaller weights to zero and
round the rest to the nearest quantization threshold to reduce the model size gbq is a layerwise quantization algorithm which aims to minimize the error at the output so it wouldn't blindly lobotomize itself however gbq and bits and bytes which I have not mentioned is not as practical for inference now since awq EXL 2 and ggf are just generally better apart from do safe tensors andf formats these condensation methods are all making models smaller after training which only potentially lets you run them because we still need to take into account of cont context length as it
would also eat up your memory and not being able to provide a decent amount of context length would make the model kind of useless since the model often needs context to solve your questions for clarification context length is everything which can be instructions input prompts and conversation history and the longer the context length a model has the more information the AI can use to process your prompt with like summarizing a paper or keeping track of what it generated in a previous conversation AI models don't save what you say to it so instead you would need
to provide history or data for it to use for the precise technical details 8,000 tokens context length is usually around 4.5gb vram but some models like mix R and deep seek can get vram usage even smaller cuz they have gqa implemented which is around 1.5gb for 8,000 tokens so does that mean it is even more doomed for typical 12gb VM consumers to run local LMS well not quite there's something called CPU offloading which lets you offload models onto CPU and system Ram it is a feature from llama C++ the same people that made ggu F
so the model has to be in ggu F format to do this but even a 12gb vrm can run Mixr ax 7B which is like 45 billion parameters in total by putting 10gb of the model in the vram with 2GB for context and the rest handled by CPU and system Ram while the trade-off is of course the speed and using the quantized model you would still get some pretty top tier results for running locally for free and if you do want to run it faster other than EXL 2 there are other Hardware acceleration framework like
VM inference engine that is great for a service side where it handles parallel requests very well and increases model speed by 24 times compared to hugging face and Nvidia is also offering tensor rtlm that you can use for some popular models to increase inference by four times with your RTX GPU they also recently released an app called chat with RTX which is a local UI bu that can connect a model to your local documents and other data so you can ask it to scan your documents without the need to upload it anywhere super good for
privacy if that's what you're looking for it can also watch a YouTube video and answer questions about the content of the video so if you're too lazy to watch my videos or just need to revisit a few key points I've made chat with RTX can answer it easily for you but if speed is not what you're looking for and you want the AI to do something more specific like becoming a chatbot that teaches you how to code or become a tech support for grandparents this is where fine-tuning comes in Kora is the current best way
to fine-tune model as it doesn't need to train the whole billion parameters model and instead only a fraction of them besides no one really fine- Tunes the entire model as it is pretty inefficient and you probably need to spend a quarter of a million just to train it for like a month but before you start fine-tuning it remember the golden rule in AI garbag in garbage out if the training data is badly organized the result would be trash the model that you choose to fine tune has to be a good fit for what you want
it to do too the data set for fine-tuning usually needs to follow the original data sets format that was used to train the model you are fine tuning on there are currently a lot of different formats and because the data formatting is how instruction Q&A and dialogue AI models are made so to reproduce something of similar format it's necessary to choose one correctly there are other fine-tuning techniques which have less capabilities than Cur however they mostly serve a different purpose too like get it to generate answers as satisfy some moral gar rails or responses that
are more preferred by humans and yeah these are roughly everything about running local lmm and AI chat Bots there are also other extensions that you can use like rag to hook up LM with a database through llama index so you can do things like asking a model about some local files just like it was demoed in chat with RTX and even in a bigger scale or replacing GitHub co-pilot with continue dodev using your local models which would save you some more bucks running local LMS may just just be the peak of money saving in this
age of AI without giving up performance so maybe this is where you start your M maxing journey in Ai and run local models that will save you extra 20 bucks during this hiring freeze but if you are looking for more gpus to run local models faster Nvidia actually send me an RTX 480 super to give away to you guys and look at this bad boy I'll probably not open this box so whoever gets this can still enjoy the throw of fully unboxing it to join in the giveaway you just need to attend at least one
virtual GTC session and show proof of you attending it which is really straightforward I'll link the participation form down in the description so you can join once GTC starts for the proof I just need you to take a selfie while watching any of the live virtual GTC sessions and if you don't want to show your face you can just show a thumbs up with GTC on your monitor or do something unique that does not look like it's generated by Ai and also very important you have to use my link down in the description to sign
up for the free virtual GTC session this time they also have the original AU authors of Transformers which is from the paper attention is all you need hosted a panel on March 20th which is going to be one of the highlights and I highly recommend you to attend so you can watch it live online so definitely add it to your schedule and participate the giveaway at the same time shout out to Andrew lelz Chris Leo Alex Jay Alex Marice Mig Lim Dean fif fall and many others that support me through patreon or YouTube follow my
Twitter if you haven't and I'll see you all in the next one and
Copyright © 2024. Made with ♥ in London by YTScribe.com