An Introduction to RAG - Part of the Free Ollama Course

4.4k views1652 WordsCopy TextShare

Matt Williams

In this video from the free Ollama Course, we look into the world of Retrieval Augmented Generation ...

Video Transcript:

as amazing as AI models are they do have some weaknesses sometimes they hallucinate and you can't just tell them not to because they don't always know that they're hallucinating if you get a good answer to a question one time you can't guarantee that you're going to get a good answer to that same question the very next time and if you ask about something that happened in the last few weeks or months or even a year there's a good chance the model won't know anything about it the reason for this is that models take a while

to get trained and it can take months to get them up to the level that they're at and that means that last bit of knowledge they have is from at least several months ago unless you get a fine tune which might be a little bit more recent but then again fine tunes don't really add a lot of recent facts and when you do get them to learn new facts they tend to get slower so how can you get it to know about more recent events there are a few approaches you can take and in this

session of the free olama course we're going to take a look at one of them welcome back to the olama course this is a free course available here on this YouTube channel that will teach you everything you need to know about how to use AMA to run artificial intelligence models locally on your laptop on your computer or even on an instance way up there in the cloud so far in this course we've mostly covered how to use the basic functionality of olama and how to find models as well as the fact that size doesn't always

matter and now we move move to one of the most common techniques for getting more recent info into your model it can also be useful for getting it to know about special techniques and processes only you or your company knows about it's called Rag and that stands for retrieval augmented generation so the model generates an answer and it's augmented with what it can retrieve from a collection of documents but there are a few steps you need to do to prepare for the prompt first consider the type of documents that you're going to be working with

the T types are documents you can get the pure text from without any obfuscation PDF was created to make it easier to print a document on a printer without having to worry about fonts downloaded to your printer yeah a long time ago we had to buy printers with fonts loaded on them and you could load more as well and the only fonts it could print were the ones the printer actually knew about PDF was pretty revolutionary making it easier to get a printed page that looks exactly the same as what's on the screen getting readable

sensible text out of it for a computer has never been the goal of PDF and sometimes the format is used to ensure that that cannot happen so if you're working with PDF then take a few extra minutes to get access to the source document in 99% of the cases that I've seen the source text is always available though it sometimes requires an email to the author or publisher you probably don't want to go through the hassle but getting clean text from the document is often even more of a hassle but if you are in that

1% and you probably aren't but if you are there are some tools that do a fair job at attempting to get clean text out of the PDF tools like Pi PDF or Pi M PDF and others can do a a decent job on some docs but fail to get good clean text on many others until you tweak them for each document and in fact P PDF's docs say that it's the PDF author's responsibility to make it easier for p PDF to work which seems like a bit of a copout the reason all of this is

important is that if you get text that is jumbled up you aren't going to get great results from your rag system now you might wonder how so many tools out there get it right with various tools well sometimes they get lucky and sometimes they don't because PDF is hard but let's assume that you've loaded up a doc and gotten the text out of it now we need to store that content somewhere but we can't store it as one contigous stream of text there are a few reasons for this the first is context size even if

the document fits in the context having a conversation with the document may fill the context later on and if most of the document doesn't help answer the question why even include it second sending too much data to the the model regardless of the context size can confuse it not everyone is a great writer and opposing viewpoints in one dock or collection of dogs can often have ill effects on the final output sometimes it helps usually it doesn't so we want to split up or chunk the text this is an area that has a lot of

options and no one best solution some like to split the text by pure character count others by tokens or maybe words or even paragraphs or sections of the document then you may want to have overlap that each chunk shares so now let's say we have our chunks of text we need a way to find the most relevant chunks of text depending on our prompt this is where embeddings come in embeddings are a numerical representation of our text we pass the chunk of text through a special embedding model and that figures out the semantic meaning of

the text which is represented by a vector in a space with many dimensions exactly how many diens iions depends on the model all text that is embedded will end up being a vector of a certain length different embedding models produce vectors of different lengths so if you change the embedding model later on you need to re-embed all the chunks in your system the magic of embedding is that if the vectors are the same length then it's mathematically very easy to compare them and find the chunks closest to the meaning of the original prompt there are

a few algorithms used for this comparison and depending on how you store the chunks you may have more or fewer choices so how do you store the chunks well you could store them in adjacent file on your system or as separate files but as soon as you get into the thousands of chunks you run into scalability issues it's going to be far more efficient to store them in what's called a vector database there are a lot of options here last I saw there were 50 plus with some of the most popular being chroma or pine

cone or milis or even add-ons to your favorite relational datab bases like postgress so at this point you have a vector store with chunks stored in it for each chunk you have the embedding stored but you also need to store The Source text the reason for this is that we need to use the embedding to find the most similar text and then we need to provide the raw text to the model in the prompt now it is possible to recreate the text from embedding most of the time but it's just more efficient if you store

the text with the chunk so now you need to build the prompt you probably started with a question like explain something from this quarterly financial report you created an embedding of that question and did a query to the database looking for the most similar chunks in the database it can potentially return a lot of results and they all have a ranking of how close they are so you choose the top five or 10 results and grab the text then add something like use the following information to answer the question and then include the text from

each chunk and then pass that whole string to the model in your prompt the rest is the same as every other interaction with a model models just understand text in the input prompt so the goal of rag is to find the relevant chunks of text and include them in the prompt this essentially makes the model know about things it didn't really know about before so in this video what we saw is rag is a way to get relevant text to The Prompt so that a model can know about specific information on a topic it gets

that text by looking in a vector database for chunks of text which are fragments of the source documents that are semantically similar to the question in the prompt those chunks are stored in the database as raw text and an embedding which is a numerical Vector generated by a special embedding model in up coming videos we'll look at the components of rag in a bit more detail and actually start using some of them we'll also build a complete rag system as well as try out some of the existing tools that have rag built in there are

some great options out there including open web UI MTI and Page assist but this video is long enough just talking about the highlevel concepts thanks so much for watching this video in the course and I hope you have a great time making AI part of your life using olama goodbye