this is going to be an intro course to everything that you need to know about rag retrieval augmented generation rag is a really misunderstood topic so I'm really excited to share just a little bit of knowledge with you about it and thanks to pine cone for not only sponsoring this video but being my exclusive Channel partner and pine cone has an incredible Vector database product which is the core of how rag works and I'll tell you more about that as we go on through this video one of the major misunderstandings about Rag and large language
models in general is how to give large language models additional knowledge a lot of people think that is what fine-tuning is for and when you fine-tune a model you give it a bunch of additional information and then it's just going to know it from then on out but that's not really what fine tuning is for and in fact what I found is nine out of 10 times when you think you need fine tuning what you actually need is retrieval augmented generation so rather than doing the more complicated task of fine-tuning a model which is really
for informing the model how you want it to respond to you in terms of tone giving it additional knowledge is as simple as using rag so first let's start with an overview of rag as I mentioned rag means retrieval augmented generation basically giving large language models an external source of information to augment your prompt I like to think about rag in two ways one it's an incredibly fast and efficient way to to give your large language model additional knowledge and it's an incredible way to give your large language model long-term memory which they do not
have out of the box so you can think of large language models as being kind of Frozen in time once they're done training they do not get any additional information unless you give it to them and there are a few ways to do that obviously you can provide additional knowledge to your large language model through the prompt itself and this is fine but as soon as you start needing to do things even at a little bit of scale that method breaks down quickly and there's a reason for that it's because the context window of a
large language model is limited so what's a context window a context window is the number of words or tokens that you can give a large language model in your prompt combined with the number of words or tokens that it is outputting as a response back to you and those are typically limited for example Lama 3 by default has an 8,000 token context window even the frontier models like GPT 40 have a 128,000 token context window that may seem like a lot but you actually use up that context window really quickly especially when you're talking about
giving a ton of additional knowledge to a large language model or storing a lot of memory from that large language model and not only that as context Windows continue to increase which they are they actually tend to to be really inefficient and really costly as a method to continuously give new knowledge to a large language model and that's why we keep turning to retrieval augmented generation so let's think about an example let's say we're building a chatbot for customer service and that chatbot needs to store the conversation between the customer and itself forever that'll allow
it to continue to get more personal and understand that user better and better over time now if we did not have rag what we would do is every single time there was a back and forth between the customer and the bot we would have to feed that back into the prompt so that the large language model knew the history of the conversation and you can imagine after hundreds or thousands of back and forths all of a sudden you're using that entire context window really quickly every single time you submit a prompt and here's the thing
all of that history might not be relevant to what you're trying to get the large language model to respond to in that moment let's use another example let's say we have internal company documents that we want to have the large language model have knowledge of first of all when this large language model let's again just use gp4 when we're using it it has no knowledge of my internal company documents and that's because that model wasn't trained on any of my internal company documents which it shouldn't be but I want to give it access to that
knowledge so if we were to take all of of those documents and put it directly into the prompt every single time along with whatever question we had of those documents that may work but as you can imagine the majority of that prompt window is going to be used up every single time and more than likely you're going to run out of that prompt window eventually if you have enough internal documents so what do we do with that that is when we use Rag and as I mentioned earlier large language models are frozen in time so
if something new in the world happened and we want to give that large language model knowledge about that event or whatever happened rag is a really great way to do that all right so let me talk about rag from the highest level what actually is it you're taking information let's say a document and you're putting it in external storage and I'll talk about what that actually looks like in a bit and then you're giving that large language model the ability to query or ask questions of that document and you can combine that with your prompt
so let's use a simple example let's say Tesla came out with their new earnings report now if we're using GPT 4 and its knowledge cut off was months ago it would have no knowledge of that new quarterly report from Tesla but what we can do is we can take that quarterly report we can store it in our rag database and then anytime that we ask a question about Tesla and their recent earnings we can actually take an intermediary step which goes and gets the relevant information from that document and appends it to our prompt so
our prompt would look something like tell me what Tesla's earnings were in this latest quarter and along with that under that prompt and included in the prompt I should say is the relevant information from the 10K document about their recent earnings and now instead of taking that 10K document and putting it in the prompt every single time we are only taking the relevant information based on what we need in that moment now this is a very simple example but imagine if you wanted the 10K documents from the top Fortune 500 companies all of a sudden
you can see you're not going to be including that every single time in the prompt you wouldn't want to either because Tesla's Revenue has nothing to do with Apple's Revenue has nothing to do with meta's revenue and so on you're asking about Tesla's Revenue so really the only thing that we should append to our prompt is the information the knowledge about Tesla's Revenue so let me give you another really specific example so here's a workflow without Rag and this is from the pine cone learning documentation so we have a generative AI search so just asking
a question to a large language model and here's the question how do I turn off the automatic reverse breaking on the Volvo XC60 now that question gets prompted to the model let's say chat GPT and since it doesn't have that exact information it's going to hallucinate meaning it's going to make something up so press the settings button on the center console or the steering wheel use the buttons or the touch screen to navigate and so on but this isn't actually what is happening so there is a much better way of doing this so now let's
look at what we need to do to actually prepare our retrieval augmented generation for that same question so we take all of the Volvo users manual and here's the thing our model doesn't have access to this currently so that is why it hallucinated so we're going to take all of them we're going to send it to an embedding model and I'll talk about what that is in a second and then we're going to take that embedding and put it into a vector database now let me break all of this down for you so the vulvo
users manual is just exactly that let's say it's PDFs or a text document just the entire manual for the Volvo cars then we send it to an embedding model which converts it from text format to embedding format and an embedding format is just a series of numbers representing those words and in the simplest terms that embedding places that natural language somewhere on a multi-dimensional graph and in this example we're just using two Dimensions so here it would place it at 1 one and then if we have a bunch of those right so here is an
example we have a bunch of different points in this Vector space what ends up happening is words terms entire phrases that are similar to each other start to end up close to each other quote unquote close to each other in this Vector space and it's a really interesting and awesome way to see if words and phrases and terms are related to each other so let's look here so imagine this is part of the vector space and we can see all of these words that are located near each other so connects occupies operates operated bounded elevation
all of these words are near each other in this Vector space and then what we do is we we have now this intermediary step before we actually prompt the large language model so we have the user who sends a query again the same one about the Volvo it first goes to the embedding model the query itself gets converted into an embedding then we query the vector database and we say what's similar to this question about the Volvo and it grabs all the similar relevant data and then we can combine the original query plus all of
this additional context or knowledge send it to the model and then we actually get a relevant accurate answer the driver can choose to deactivate Auto break with rear Auto break and cross traffic alert the warning signal can be deactivated separately and so on and so now we have non- hallucinated accurate information because we were able to give our large language model an external knowledge source and rag is not only powerful with large language models but when you start to abstract things and use agents it becomes even more powerful and rag is one of the main
tools that agents use to have all of this additional knowledge so let me show you this example and so what we're looking at is we have a question aside from the Apple remote what other device can control the program Apple remote was originally designed to interact with now if you ask the llm directly it's going to give you a very basic direct answer iPod now if we use agents with the power of rag we can allow the agents to iteratively come up with a plan do the research incorporate external knowledge sources and then come up
with an even better answer so let's look what that looks like thought one and you can just think about each of these thoughts as maybe different agents but they can be structured in different ways I need to search Apple remote and find the program it was used for so search Apple remote and this might be an external knowledge source that details how the Apple remote is used or what it's used for observation one the Apple remote designed to control the front row Media Center then thought two I need to search the front row and find
other devices that control it so now search front row documentation then front row is the software was controlled by an apple remote or the keyboard functions then another one front row is controlled by an apple remote or the keyboard function keys I now have the answer answer keyboard function keys so it was able to use a lot of different sophisticated tactics with this external knowledge source and now let's take a step back and look at what's actually happening again in a little bit more detail remember we talked about putting the embeddings in a vector space
well that's exactly what pine cone is built for but instead of one point with two Dimensions it can have billions or even trillions of points and tons of different dimensions that is why pine cone is so good and that is why Vector storage is so good it is lightning fast and you use natural language to query against it now if this sounds familiar to just search that's typically how search is done so if you're searching for something you're searching in natural language you're typing out your query in natural language and then your query is plotted
on that Vector space like I mentioned and it looks for everything that is relevant to your query relevant meaning close in the vector space and then it returns it and that's how search works and that's also how retrieval augmented generation works and pine cone makes it super easy to do that in fact you really don't need to know anything about how the stuff actually works as a developer to actually go and build something with it and that was the case for me when I first started I didn't really understand what rag was or what a
vector database was but I was still able to build something with it and all you do is you take your data you send it to an embedding model and there are plenty to choose from then you send it to Pine Cone and they'll store it for you and then when you're ready cre a query against it you send them your query in embedding format and they will return the results for you and there's a ton of different options and what I will say is pine cone is known for scale it is super fast and Incredibly
High scale efficient and by the way if you want to see me do a full tutorial about how to actually set all of this up from provisioning your pine cone account to taking your documents converting them to embeddings storing them in the vector space let me know in the comments I'm happy to do that it's actually a lot simpler than you think so that's the basics of rag for today if you want to see me do a tutorial using pine cone let me know in the comments if you want a deeper dive into rag also
let me know in the comments and again a very special thank you to pine cone for sponsoring this video for partnering with me on the channel and I'm going to drop all of the links to Pine Cone and everything else you need to know about rag in the description below if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one