Feed Your OWN Documents to a Local Large Language Model!

238.12k views3613 WordsCopy TextShare
Dave's Garage
Dave explains how retraining, RAG (retrieval augmented generation) and context documents serve to ex...
Video Transcript:
hey I'm Dave welcome to my shop today's episode is one of the most requested I've ever done how to add your own knowledge files and documents to a large language model both local and online I'll explain the difference between retraining a model using retrieval augmented generation or rag and providing documents for the context window I'll show you how to upload your own documents both to chat GPT and to a local olama model running under open webui once you insert your documents into the model they become part of its knowledge base for answering your questions now
before we dive into retraining Rag and context documents however I want to give you a little demo of a modestly siiz model running locally on the Dual Nvidia RTX 60008 setup that's because in my last episode I saddled the big workstation with a massive model that the other machines couldn't even hope to run 405 billion parameters loaded into its 512 GB of RAM but the protests in the comments were many folks wanted to see this Beast of a machine chewing on a regularly sized model to see how it would perform so let's take a quick
look at just how fast it can run a 1 billion parameter model so to run the smaller model I'll take the listing and we'll see that we do in fact have Lana 3.2 the 1 billion parameter model and so I'll run that with the verbose flag and as soon as the model comes up and is ready we'll ask it to tell us a story and we'll see how many tokens per second it can actually generate well it's cranking through pretty quick here 345 tokens per second let's get a longer story to see if it can
sustain that it Scrolls by very quickly and when it is done how many tokens per second do we get 324 so it's able to easily sustain over 300 tokens per second generating on the Dual RTX 60008 machine I would mention that it's a thread Ripper but it's using almost no CPU during that generation for comparison we'll launch the 70 billion parameter model which is predictably about 70 times as big as the 1 billion parameter we'll ask you to tell me a story and we'll see how many tokens per second it can generate once it's done
generating now I'll let it run for a little bit here so you can see the natural output speed which is still very usable if you weren't generating tons of content I think it's actually quite fast but it's certainly slower than the 1 billion let's fast forward to the end and see just how many tokens per second this one is generating and at about 20 tokens per second it's quite a ways off of the 300 Pace but it's still pretty useful and it's more than usable so it's still fast enough for most purposes now let's get
to our main topic of adding information to existing large language models there are three primary ways which you can do it retraining a model using retrieval augmented generation or rag and uploading documents directly into the context window first let's consider retraining a model think of the model as a bit like being a student who has already learned a lot but now you want to teach them something new or correct their understanding you go back to the basics with them bringing in new books updated lessons and putting them through another round of study sessions they don't
forget that much of what they already know but they use your training to fine-tune and add to their knowledge the process is thorough but it's a lot of computational work takes a lot of time and once they've learned it's permanent every time you use the model from then on it will have that updated knowledge embedded deep within it ready to apply in all relevant situations but retraining a model requires a lot of resources more data lots of computing power and time it's like sending your student back to school for a while to learn and improve
on what they already know now let's compare that to retrieval augmented generation or rag here instead of retraining the model it's as if your student doesn't have all the information they need at their fingertips but knows exactly where to look and when asked a question they quickly consult a library of books pulling out the most relevant sources and then give you a response that combines what they knew before and what they've just looked up this process is faster than retraining because it doesn't involve the Deep permanent learning of new material instead it allows the model
to retrieve information dynamically from a database or document pool crafting its answer based on up-to-date sources it's a much more agile process and it's great when you need the model to adapt to ever changing or large sets of data without retraining it every time and finally there's uploading documents to the context window imagine you're in a one-on-one conversation with the model and you hand it some notes the model can reference these notes while talking to you but it won't internalize them in the way it would with actual retraining it's like a student who gets to
peek at a cheat sheet during an exam they can look at your document and use it to answer your questions but once the exam is over or your session ends they'll forget the information this method is the quickest way to provide specific immediate knowledge but the information only lasts as long as that specific session or conversation when you're done the model won't retain the uploaded document unless you upload it again next time so in summary retraining builds long-term permanent knowledge within the model rag fetches relevant knowledge dynamically without needing to retrace and uploading documents into
the context window is like giving the model a temporary cheat sheet for quick reference now each has its own strengths depending on how permanent or flexible you need the model's knowledge to be before we get into these approaches let's look at why we won't be doing the first method a formal retraining of a model and the first reason is openness meaning that you have no access to the chat GPT model itself so there's no way you can directly your modify or retrain it with something that's more open like llama 3.2 the models are generally designed
to be open in the sense that you can access the node weights and modify them but certain conditions still apply you need to access the model weights the core data that defines the model's knowledge depending on licensing and availability you may or may not be allowed to use those weights freely for commercial purposes or large scale projects but let's say none of those are serious roadblocks but the next problem is hardware and software retraining a model takes almost as much in terms of skill and resources as training that model did in the first place fine
tuning and llm like llama 3.1 even on a smaller data set still requires some serious Hardware that means you're looking at leasing or buying data center gpus like the Nvidia a100 and depending on the model size it could require an enormous amount of ram to work on it the other hurdle at this point is that it requires significant programming even if it is mostly just python but you wind up needing to write code using pytorch or tensor flow or something similar to run your fine-tuning tasks and when combined with all the hardware required this likely
rules without full retraining for most people so we'll focus on Rag and context both of which are fully doable without custom coding let's start with the easiest of these mechanisms uploading additional documents into the model's context window our documents become the cheat sheets that we referred to earlier allowing the model to reference them and incorporate them into its answers with chat gbt you've likely noticed and even used the upload button at some point when you upload a file to chat GPT it becomes part of its current context window okay let's let ask about something that
it may have knowledge on from its general knowledge base but I want the specific order in which to push the buttons on the 1134 to boot it up let's see if it knows so that's sort of a general approach for starting a pdp11 um it does not take into account the owner's manual of the 1134 which tells you how to actually use the boot switch to boot to a ROM location which will bootstrap the machine so it doesn't seem to have that context so in our next step let's give it that context this is the
simplest of cases Beyond just dragging a document into the actual browser window but we'll click on the paperclip to attach a document we'll say upload from the computer and I will pick Let's see we'll pick the PDP 1134 manual here now that it has that document and it's uploaded we can ask it questions about the document read the attached document and tell me the order to push the button to start the system and now thanks in part to the technical documents that we uploaded at the beginning of the session it has the information that it
actually needs to give us the correct answer to use the boot and it switch which will go through the m931 bootstrap which will boot the system from ROM you could do this each and every time that you wanted chat GPT to have access to this additional knowledge but it gets a little cumbersome which is especially true if you have multiple documents to do each time let's look at a better approach creating our own custom GP PT with our context documents fully baked in okay here we are at chat GPT but we don't want just regular
chat GPT not even 01 preview not even GPT 40 with canvas we don't want any of those things we're going to create our own chat GPT let's do that by going explore gpts then in the top right you'll see create and then we give a name and a purpose to our custom GPT pdp11 expert and it is a expert at pdp11 stuff and what it's going to do it's going to answer questions for us about that stuff now this next point is where we can upload our files to form part of its knowledge so we
click upload files and I've got a folder called PDP where I have a bunch of documents that are already OCR and their PDFs and it can handle that format so we'll just upload them directly it'll take a few against upload the files but it goes surprisingly quick and now that they're all uploaded soon as our button becomes available here we'll click create and I'll say anybody with the link can access it pdp1 expert is a name and we'll save it next we can click on view gbt to actually use it next I'll ask it a
very specific question that will require it to do some research we'll try that it's searching its knowledge base let's hope it finds something you kind of have to actually read the document but it actually does have that information so let's see what it comes up with 8067 that's the correct part number and that is how you wire wrap the boards that would have been really handy when I did this about two weeks ago I mean why read the manual when you can just ask chat gbt in a previous episode I showed you how to run
olama and open webui locally on your machine so let's take a look at the steps needed to provide context documents to AMA so here we are in the local machine and running o Lama and open web UI I've clicked upload file by clicking on plus and I've selected the PDP 1134 user manual just as I did last time with chat gbt now I'll ask you a different but equally specific question what are the power requirements for the PDP 1134 and it cranks out an answer really quickly and it is in fact the correct answer and
at the bottom of our answer you will see that it is referenced the context document even better we can click on that link and bring it up and actually see the part of the document that is making reference to I find that incredibly handy when you're looking up a citation of where it got the information that you're asking it about the size of the context window itself can also be quite constraining think of numbers like 4,000 tokens for a model like llama 3 if the documents you were trying to provide exceed the size of that
context window clearly it's not going to work as well for example I have a set of a half dozen pdp1 reference documents that I'd like to incorporate into my searches and the context window is generally not large enough for that our step then is with retrieval augmented generation or rag which is a system designed to dynamically retrieve information from an external knowledge base or database so when the user asks a question the system searches through a large repository of documents or data and pulls out the most relevant pieces this information is then combined with the
model's internal knowledge to produce a more accurate and contextually informed response the key strength of rag is its ability to efficiently handle large amounts of data pulling only what's necessary at the moment of the query this allows for a more scalable approach especially when dealing with complex evolving data sets the retrieval process ensures that only the most pertinent information is used which makes rag particularly useful for situations where accuracy and specificity are crucial in contrast uploading documents into the context window is a more static approach when documents are uploaded their content is directly inserted to
the model's input window giving the model access to that information for generating responses it's essentially like you had just typed the documents into the query window while it's a straightforward way of providing additional information to the model it can be inefficient for large or numerous documents the model has to work with everything that's provided in the input regardless of whether all of it's relevant to the specific question being asked or not so Rags ability to dynamically retrieve information makes it a more efficient and scalable system it can handle large data sets without overloading the models
memory or processing capacity the retrieval process allows the system to be more selective bringing in only the most relevant pieces of information in response to your query this also makes it more adaptable in situations where the information being accessed frequently changes as rag can always pull the latest updates from The Source without needing any manual adjustments now the pdp1 isn't changing much these days but to maintain accuracy on more current things the user would need to regularly upload newer revised documents which adds a layer of manual upkeep ultimately rag offers a more Dynamic and scalable
approach especially well suited for handling large and evolving knowledge bases while uploading documents into the context window is more suited for smaller more static data sets rag ensures that the model's responses are not only accurate but also timely reflecting the most upto-date information available especially if you're pulling the data from a database directly then you're getting the very latest snapshots of the database if you're not doing that even with documents if they're updated frequently at least you're getting the most frequent documents conversely uploading the documents provides a simpler though more limited method inad of augmenting
the model's knowledge so let's take a look at setting up rag on our own system AMA one note about the configuration I'm going to run the open web UI locally this time rather than in the docker container simply by enlisting in it from GitHub and then launching the start script from the folder called backend I'm going to do it that way because it makes the documents folder easily available in my local file system if you're running it in a Docker container it'll still work but you'll need to use the docker copy command to copy your
files into the containers version of the data document folder to make our documents available to models within the olama open webui system we need to go to the admin settings so we'll go to the admin panel we go to settings and we'll go to documents now I've copied the documents that I'm interested in into the actual folder underneath the open web UI folder that I enlisted in in order to put the documents there but it won't automatically find them until you come in here and scan so let's click on scan and see what happens we
monitor the other window to see its progress now you can see it's obviously pulling the deck manuals and so on I can see stuff about cylinders and hard drives and so on so it's now parsing this data processing it and it is now ready so I can click save which is slightly obscured by a little dialogue box that I can't get rid of for some reason but I'll click save and those documents are now available to be used in creating custom models now that we have our documents registered with the open web UI interface we
can create a new model model which incorporates the knowledge encoded in our documents when we subsequently query the model for answers it will use retrieval augmented generation to incorporate that knowledge into those answers so in the open web UI interface we want to go to workspace once we're in workspace we want to create a model we'll give it a name we'll select a base model to work with and I'm actually going to go with the smaller but efficient L 3.2 because I want it to be more dependent on my data and less on its own
data that it brings to the table whoops I got those backwards Let Me swap those we'll put that there and we'll type pdp1 expert in the description and now it's a simple matter of telling the model which documents to reference the easiest way is to pick all documents and it will be able to reference everything in your data/ doents folder however that will be slower so in this case I'm just going to give it at one or two documents that applied directly to my actual machine and we'll see if it can then incorporate that document
knowledge into its answers using rag now back at the main chat window to interact with our model we'll pick it from the list it's now available as the PDP expert let's try a really specific question and see how it does and you can see in the left hand window it is found and hit our data and referred to it to produce its answer ah and because I probably didn't give it the information that it needed on the memory board the closest it could find was the RQ dx3 controller so it's going to answer its questions
with pretty much that as its main knowledge about the pdp1 so let's ask it something that it might know okay that's a better answer and let's see so it's got the right jumper and it's found the right point in the manual I don't know if the manual has actual more detail than that or not but it did find the right reference so as we can see it is in fact hitting all the rag data which you can also see scrolling by in the leftand window as we do a query and again it's able to synthesize
and recite the facts that it learned from looking at the rag data and give me little look at be approximately correct answers and so by creating a custom GPT with your documents embedded in it it will use rag to update and do retrieval augmented generation as it answers your questions if you found today's little sampler of AI augmentation to be any combination of informative or entertaining please remember that I'm mostly in this for the subs and likes so I'd be honored if you'd consider subscribing to my channel and leaving a like on the video before
you go today and if you're already subscribed thank you and with that you're now an expert on contacts windows and retriever augmented generation collect your official certificate on the way out the door be sure to check out the second Channel Dave's attic where you'll find our weekly podcast where we answer viewer questions on episodes like this live on the air well not really live it's recorded a day in event so almost live on the air the podcast go live every Friday afternoon and they're called shop talk so take a look maybe watch a back episode
or two see how it goes thanks and I'll see you next time
Copyright © 2024. Made with ♥ in London by YTScribe.com