Python RAG Tutorial (with Local LLMs): AI For Your PDFs

306.93k views4430 WordsCopy TextShare
pixegami
Learn how to build a RAG (Retrieval Augmented Generation) app in Python that can let you query/chat ...
Video Transcript:
In this video, we're going to build a Python RAG application that lets us ask questions about a set of PDFs we have using natural language. The PDFs I'm going to use here are a bunch of board game instruction manuals for games like Monopoly or CodeNames. I can ask questions about my data, like "how do I build a hotel in Monopoly?
" The app will give me an answer and a reference to the source material. Now, I have done a basic RAG tutorial before on this channel, but in this video we're going to take it up a notch by introducing some more advanced features that you guys asked about in the comments last time. We're going to cover how to get it running locally on your computer using open source LLMs.
I'll also show you how to update the vector database with new entries. So if you want to modify or add information, you can do that without having to rebuild the entire database from scratch. Finally, we'll take a look at how we can test and evaluate the quality of our AI generated responses.
This way you can quickly validate your app whenever you make a change to the data source, the code or the LLM model. All right, let's get started. If you haven't built an app like this before, then I highly recommend you to check out my previous video tutorial on this topic first.
It will help you to get up to speed with all of the basic concepts. Otherwise, here's a quick recap. RAG stands for Retrieval Augmented Generation, and it's a way to index a data source so that we can combine it with an LLM.
This gives us an AI chat experience that can leverage that data. Here's a quick demo of the completed app. I have my Python script here and I'm going to ask a question about my data source, which is going to be board game instruction manual.
So I can ask, "how do I build a hotel in Monopoly? "" And the result is that it gives me a response based on the data that it found in the PDF sources that I provided it. So the response is going to use that and actually phrase it into a proper natural language response.
It's not just going to copy and paste the raw data source. And here it's telling me that if I want to build a hotel, I need to have four houses in a single color and then I can buy the hotel from the bank. And in this version of the app, I'm also using a local LLM model to generate this response.
So here I have my Ollama server running in a separate terminal. If you don't know what that is yet, that's okay. We'll cover it later.
But here's the actual LLM reading the question and then turning this into a response. Here's a quick recap on how that all works behind the scenes. First, we have our original data source, the PDFs.
This data is going to be split into small chunks and then transformed into an embedding and stored inside of the vector database. Then when we want to ask a question, we'll also turn our query into an embedding. This will let us fetch the most relevant entries from the database.
We can then use those entries together in a prompt and that's how we get our final response. For this tutorial, we're going to mainly focus on the features I mentioned at the beginning of the video. But for everything else, we're going to be speeding through it a little bit.
So if you feel like it's all going a little bit too fast, you can either check out my previous RAG tutorial video first to learn the basics. Or you could also follow along by looking through the code itself on GitHub. Links will be in the description.
Here are the main dependencies I'll be using in this project. So go ahead and install or update them first before you start. First, we'll need some data to feed our RAG application with.
Gather some documents that you'd like to use as your source material. In my previous video, a lot of you asked me how to do this with PDFs. So I'm going to be using PDFs here.
I'm going to use board game instruction manuals. I've got one for Monopoly and I've also got one for A Ticket to Ride. And I just found these for free online.
So you can use whatever you want, but this is what I'm going to use here. Just download the PDFs you want to use online and then put them inside a folder. In this case, I've put it inside this data folder here in my project.
This is the code I can then use to load the documents from inside that folder. It's using a PDF document loader that comes with the Langchain library. And for future reference, if you want to load other types of documents, you can head over to the Langchain documentation.
Look up document loaders and then just pick from any of the various available document loaders here. There's things for CSV files, a directory, HTML, Markdown and Microsoft Office. And if that's still not enough, you can click on the document loader integrations and there's a whole list of third-party document loaders available for you to choose from as well.
And if you want to see what one of these documents looks like after you've loaded it, you could just go ahead and print it out. You should see an object like this. So each document is basically an object containing the text content of each page in the PDF.
It also has some metadata attached, which tells you the page number and the source of the text. Our next problem is that each document or each page of the PDF is probably too big to use on its own. We'll need to split it into smaller chunks and we can use Langchains built-in recursive text splitter to do exactly that.
After you run that on your documents, you'll find that each chunk is a lot smaller. So this is going to be handy when we index and store the data. Next, we'll need to create an embedding for each chunk.
This will become something like a key for a database. I actually recommend creating a function that returns an embedding function because we're actually going to need this embedding function in two separate places. The first is going to be when we create the database itself.
And the second is when we actually want to query the database. And it's very important that we use the exact same embedding function in both of these places. Otherwise, it's not going to work.
Langchain also comes with a lot of different embedding functions you can use. In this case, I'm using AWS Bedrock because I tend to build a lot of stuff using AWS already. And the results are pretty good, from what I can tell.
But you can switch to using a different embedding function as well. You can choose from any of the embedding integrations listed here on the Langchain website. For example, if you want to run it completely locally on your own computer, you can use an Ollama embedding instead.
Of course, for this to work, you also need to install Ollama and run the Ollama server on your computer first. If you haven't used Ollama before, you can think of it as a platform that manages and runs open source LLMs locally on your computer. Just download it from the official website, Ollama.
com, and then install any of the available open source models like Llama2 or Mistral. You can then run this command to serve the model as a REST API on your local host. Now, you'll be able to use an LLM just by calling this local API.
Of course, the Langchain module for Ollama embeddings will handle all of this for you as long as the server is running. However, just as a heads up, for my own testing using one of the 4GB models on Ollama, the embedding results just weren't very good. For RAG apps, having good embeddings is essential, otherwise your queries won't match up with the chunks of information that are actually relevant.
So for myself on this project, I'm still going to use a service like OpenAI or AWS Bedrock for the embeddings. But if your computer can handle it, you can try using a larger, more powerful model on Ollama as well, and please let me know how that goes. By the way, some of you might be wondering at this point, how did I measure the quality of the embeddings?
Well, we'll get to that later when we look at testing. Now let's walk through the process of creating the database. Once we have the documents split into smaller chunks, we can use the embedding function to build a vector database with it.
So just as a quick recap, a vector is something like a list of numbers, and our embeddings are actually a vector because they're just a list of numbers. So a vector database lets us store information using vectors as something like a key. And in this video, we're going to be using ChromaDB as our vector database.
In my first video, we actually had code that looked a lot like this, and it's useful if we wanted to create a brand new database from scratch. But what if we wanted to add or update items in an existing database? ChromaDB will let us do this too, but first we'll need to tag every item with a string id.
Let's go back to our chunk of text and figure out how we can do this. So as you can see, each chunk already has its source file path and a page number. So what if we put it together to do something like this?
We'll use the source path, the page number, and then the chunk number of that page. Because remember, a single page could have several chunks. That way, every chunk will have a unique but deterministic id.
We can then use this to see if this particular chunk exists in the database already, and if it's not, then we can add it. Implementing this is pretty easy as well. We can loop through all the chunks and look at its metadata.
We'll concatenate the source and the page number to make an id. But because a single page is split up into multiple chunks, we actually have many chunks sharing the same page id. Solving this is pretty easy though.
We can just keep count of the chunk index for a page, and then reset it to zero whenever we see a new page. So putting all that together, we now have a chunk id that looks something like these. Each chunk is now guaranteed a unique and deterministic id.
Let's add it back into the metadata of the chunk as well so we can use it later. Now, if we add new PDFs or add new pages to an existing PDF, our system will have a way to check whether it's already in the database or not. So let's hop over to the code editor and see this in action.
Currently, in my data folder, I've got a Monopoly PDF and a Ticket to Ride PDF. So now I'm going to add a new PDF to this folder. It's going to be the one for CodeNames.
This is the one I'm adding. So now when I populate the database, I want my program to detect that this one is new, but the other two already exist. So I only want this one to be added.
So here, right away, it's quickly detected that there's 41 documents already inside the database, but we have 27 new documents that we need to add just because I moved that new pdf into the data directory as well. So that was a new one. And this time, even if we run the same command again to populate the database, it can see that all the documents, all the pdfs inside that data folder have already been added from the previous step and there was nothing new to add.
So this is exactly the behavior that we want. Although this implementation will let us add new data without having to recreate the entire database itself, it's actually not enough for us if we wanted to edit an existing page. For example, if I modify the pdf content in this chunk here, the chunk ID will still be exactly the same.
So how do we know when we need to actually update this page? This problem is out of scope for today, but there's actually many ways to solve this. If you think you know the solution, then please share it in the comments.
Now let's close the loop on this and actually take a look at the code that you need for updating your database. Now that we've given every chunk a unique ID, let's add them to the database. If you're using chroma, you can first load up your database like this, using the same embedding function we used earlier.
Let's go through all the items in the database and get all of the IDs. If you're running this for the very first time, then this should be an empty set. After that, we can filter through all of the chunks we're about to add.
If we don't see an ID inside the set, that means it's a new chunk and we should add it. From there, it's all pretty easy. It's just a few lines to add the documents to the database.
Just don't forget to also add the IDs explicitly as well. If you don't specify a matching list of IDs for the items that you're adding, then chroma will generate new UUIDs for us automatically. It's convenient, but it also means that we won't be able to check for the existing items like we did earlier.
So if that's the case, when we try to add new items, we're just going to end up with a lot of duplicated items inside the database. Now let's put all this together and make this not just functional, but also able to run locally as well. If you were using Ollama's local embeddings from before, you'll be able to do everything 100% locally, end to end.
Or you might end up with more of a hybrid approach like me. I use an online embedding model because it's better than what I can do locally. But I found that as long as the embeddings are good, I can actually get pretty impressive results using a local LLM to do the actual chat interface.
So that's what we're going to do here. We can start by creating a new Python script or function that will take our query as input. We'll also have to load the embedding function and the database.
We'll need to prepare a prompt for our LLM. Here's the template I'm going to use. There's two variables we'll need to replace here.
First is the context, which is going to be all the chunks from our database that best matches the query. And then second, it's the actual question that we want to ask. So we'll put that whole thing together and then we get the final prompt that we want to send to our LLM.
To retrieve the relevant context, we'll need to search the database, which will give us a list of the top K most relevant chunks to our question. Then we can use that together with the original question text to generate the prompt. If you decide to print out the entire prompt at this stage, you should see something like this.
So you've got your entire prompt template here, but you could see that our context section already has some of the chunks from the instruction manual formatted in. And I put my k=5, so there's actually five different chunks. And this is all part of one big prompt.
This is the information that my system thought was the best matching to answer our query. And then I kind of reiterate the question that I want right at the end after I've given all of this context. So here the question is, how many clues can I give in code names?
And the response is, in code names you can only give one clue per turn, and the clue should be a single word. And then I also have the sources of this answer cited here, so that's basically where all these chunks were found. After you have the prompt, the rest is super easy.
All you have to do is just invoke an LLM with the prompt. Here I'll use the Mistral model on my local Ollama server. It only needs four gigabytes to run, but it's actually quite capable.
And if you want, you can also get the original source of the text like this. Now let's go back to our terminal and see this in action. So I'm going to use this program and I'm going to query it.
How do I get out of jail in Monopoly? And now the program stopped running, so let's go and see what it did. Here you can see that we find all the relevant chunks.
So this one is the most relevant, and it's actually spot on. It actually gives us step-by-step instructions on how to get out of jail. So I think really this is the only one we need.
But anyways, we put our limit to five, so we also get a bunch of other chunks that may be relevant to the question. And then as part of the prompt, we reiterate the question again so that our LLM knows what to answer. And using all of that information, this is the response our LLM came up with.
So it came up with four different ways we can get out of jail in Monopoly. And then right at the end, we also have the sources of all of this information. So that's what it's like when we run the entire application.
And even though I used AWS Bedrock for the embeddings, because I couldn't get local embeddings that were good enough, this part to generate the question still uses a local Ollama server. So if I go to my other terminal here, see where my Ollama server is running, you could see it logging the work that we're doing. We now have a RAG application that works quite well end-to-end.
We can get it to answer our questions by using the embedded source material, but the quality of the answers we get would depend on quite a lot of different factors. For example, it could depend on the source material itself, or the way we split the text. And it will also 100% depend on the LLM model we use for the embedding and the final response.
So the problem we have now is, how do we evaluate the quality of responses? This seems to be a subjective matter. Let's see if we can approach this with unit testing.
If you've never worked with unit tests in Python before, then you can also check out my other video on how to get started with pytest. The main idea here is to write some sample questions and also provide the expected answer for each of those questions. So given a question like, "How much total money does a player start with in Monopoly?
", the answer I'd expect my RAG application to respond with is 1500. You want it to be something that you can already validate or already know the answer for. We can then run the test by passing the question into our actual app, and then comparing and asserting that the answer matches.
But the challenge with this is that we can't do a strict equality comparison, because there could be many ways to express the right answer. So what we can do instead is actually use an LLM to judge the answer for us. This won't always guarantee perfect results, but it does get us pretty close.
We can start by having a prompt template like this, that asks the LLM to judge whether these responses are equivalent. Then, as part of our test, we'll query the RAG app with our question, and then we'll create a prompt based on the question, the expected response, and the actual response. We can then invoke our LLM again to give us its opinion.
We can clean up the response we get from that, and finally check whether the answer is true or false. And this is something we'll actually be able to assert on as part of our unit test. So putting all that together, I can wrap this into a nice helper function that returns true or false.
Then, I can just write a bunch of unit tests using that helper function, and I can write as many test cases as I want. This will give me a quick way to see how well my application is performing, especially after I make updates to the code, the source documents, or the LLM model itself. Now let's hop over back to our editor to do a quick demo.
So I've got my test file here, and here is the helper function that you saw earlier, and here is us trying to interpret that result into either a true or a false result, and here is the prompt template. So these are going to be my two test cases. I'm going to test the monopoly rules, and I'm also going to test the ticket to ride rules.
So two test cases. Let's see how it does. Okay, and in this case, both of my test cases passed.
Let's expand this window and actually take a bit of a closer look. So here, my expected response is 10 points, and the actual response is "The longest continual train gets a bonus of 10 points. " So these are not exactly the same string, but they're still saying the same thing.
And this is true. So this was successful. And then if I go up to my monopoly one, the expected response is 1,500, and the actual response is also 1,500.
And as you can see again, the format is slightly different, so we need the LLM to tell us whether or not these actually mean the same thing. So this one passed as well. In this case, both of our tests passed.
Now, we have to be careful with this because we don't know whether it passed because the evaluation was good and the answer was correct, or if our LLM turns out to be too generous, we might actually end up passing the wrong answers. So it's also good to do a negative test case to kind of check that. So what we could do is we can turn this expected response into something we know that's wrong and then check that it actually fails.
We want it to fail in that case. So I'm going to put 9999. Okay, and I'm now running that test again, expecting this case to fail.
And here it actually does fail, which is good. That's exactly what we wanted. So we have our fake expected response of 9999, and then the actual response is still the same from when we asked it before, which is 1,500.
And our LLM evaluation correctly determines that this is the wrong response. So our test will fail in this case, and our entire test suite will fail. However, if we want a failing test, if we want this negative case to be used as part of our suite in the correct way, what we could actually do is go back to our test case here and then invert the assertion.
So instead of asserting that this is true, we can assert that this is actually going to fail. And that also tells us that this answer should be wrong, and something is wrong if it's not wrong, if that makes sense. So let's go ahead and run this again.
So this time the LLM still believes that the response doesn't match, and it's false. But because we've inverted the assert case, the entire test suite still manages to pass. So I recommend that if you're going to write tests for LLM applications like this, it's good to have both positive cases and negative cases being tested.
And by the way, if you do have a lot of different test cases you want to use, you maybe don't need to assert that 100% of them succeed. You could maybe set a threshold for what is good enough. For example, 80% or 90%.
So now you've leveled up your project by learning how to use different LLMs, including a local one, and you've also learned how to add new items to your database, and how to test the quality of your application as a whole. These were all topics that were brought up in the comment section of my previous RAG tutorial. And so after watching this, if there's more things you'd like to learn how to do, like deploy this to the cloud for example, then let me know in the comments of this video and we can build it together in the next one.
I know we went through the project quite quickly. My focus here was to show you the coding snippets that mattered the most and helping you to understand them. So I've actually had to simplify a bunch of the code and the ideas along the way.
But if you want to take a closer look and see how all the pieces fit together into a project, or you just want to download a code that you can run right away, then check out the GitHub link in the video description. There you'll have access to the entire project that I used for this video, and something that I was running end-to-end as you saw in the demo here. Anyways, I hope this was useful, and I'll see you in the next one.
Copyright © 2025. Made with ♥ in London by YTScribe.com