Build high-performance RAG using just PostgreSQL (Full Tutorial)
18.43k views6723 WordsCopy TextShare
Dave Ebbelaar
Want to get started with freelancing? Let me help: https://www.datalumina.com/data-freelancer
Need h...
Video Transcript:
in this video we're going to build a high performance rack solution for your AI applications using PG Factor scale and python I'm going to walk you through this entire repository which I will also make available for you showing you all the steps on how you can set this up all you need to follow along is a Docker installation python an open AI key and something to view your postgress SQL database in and if you follow all the steps in this video then at the end you will have your very own fully open- Source database that you can use to store your vectors in which we can then use with our R system to ask questions retrieve the most relevant results and then use those results with an AI to get answers to our questions now if you don't know who I am my name is Dave abelar I'm the founder of data Lumina and I've been building custom data and AI solutions for the past 5 years and I also share educational videos like this here on YouTube to help you do the same and ultimately start freelancing all right so let's get into things now this is going to be a really in-depth Hands-On tutorial where the main goal is really going to be to show you and to help you understand how you can set up PG Factor scale and how you can use it to use post SQL as a factor database which is what we currently use in data Lumina and this is our preferred method by far right now I will explain why that is the case and show you how you can set it up and now then next to setting up that infrastructure so to to get your database ready and put your data into it we also have this entire repository which is with this app folder over here where we're going to focus on two main files so we have insert factors which is going to show you how you can take uh where is it the data an FAQ data set and vectorize it and put it into the data set and then we also have another file over here which I showed in the intro where we can perform the similarity search and then also synthesize that using AI so these will be the two main python files that we will be working from but behind the scenes that there's actually a lot more there is an llm Factory that we will get into there is a synthesizer with some prompts in there we have the whole database class that we can use to interact with a database so it's going to be a lot and you'll have everything you need to build your own high performance rack Solutions with PG Vector scale and python now before we get into the to the setup um some quick context because it took me some time to to figure all of this out and feel free to skip over this part if you don't care about the theory behind it but it's important for you to to understand so quickly kind of like recap why use postgress SQL and also then with PG Factor scale as a factor database over a dedicated Factor database like for example pine cone or wv8 or quadrant and it's basically because of Simplicity because with this this approach you can use one database to manage your regular data so this could either be chats or messages or uh the content that you generate and you can store that within your postr SQL database and next to that you can have another table in that same database where you store your embeddings that you use for your rack system so overall this just simplifies your your deployment and your your apis and two years ago we started out with pine cone and we quite quickly came across PG factor which I also created a video about this is almost uh this is already eight months ago and PG factor is the extension that enables a regular postgress SQL database to use it as a dedicated Factor database and by then it was already shown that in most cases the the speed uh and the latency is is better or in is most cases just as good as using a fully like dedicated Factor database making it I would say pretty much redundant or Overkill to use it now right now fast forward to 2024 we also have PG Factor scale which is uh from a company called time scale um they make great products to enhance postgress SQL databases I'm also a big fan of their hyper tabls for time series data but this PG Factor scale if we come to let's see come to their repository you can see this complements PG Factor so at the core we are still using a postal database we use the PG Factor extension to enable this uh Vector uh this these this Vector capabilities and then through PG Factor scale we introduce a new index and this is this is their uh unique approach to this where through this index we can really speed up our querying uh results even more and if you look into some of their blog posts so for example how we made prar SQL a better Factor database you can look into some of the speed comparisons that they run over here but uh for me at least for the projects that we've been working on there is absolutely no need to use a dedicated Factor database now I got some comments here on this video as well but it's going to depend on your application and if you scale into like the billions of Records then it might make sense at some point but probably for most of you watching a regular post SQL database with this setup is going to be enough more than enough and it will be fast enough to cover and fill all the needs of your AI applications and with this setup we have a robust open- Source database which can manage both relational data and Vector data and it's super fast so really what else do you want from a system like this and that's why we are currently using it so that is the quick theory behind it but I just wanted to get that out there so you understand okay difference dedicated Factor database versus post SQL PG factor and PG Factor scale and also how time scale plays a role in this so there are more references here in the documentation if you want more background information but now we can get into the setup and I'm always trying to find the balancer because I see we're already a couple of minutes into the video between keeping it practical like Hands-On is that is what you you come here for right but then also giving you enough context to understand whether this approach really actually makes sense for you knowing all the differences so I hope it's uh it's not too long so the steps we're going to set up our Docker environment we're going to connect to the database we're going to run the python script and then perform the similarity search with another python script so let's come over here to the code base make this a little larger for you and again this is all on the G the repository Link in the description you can get this here as well and we are going to start with this Docker compose file because that is the first step in here and that is which we are going to run using Docker compose now if you're new to Docker open up C GPT and just throw this in here and ask what is this what is Docker compos explain this file to me but what we are essentially going to do we are going to use Docker to run this up um and and start running this within an isolated container within uh or on our machine my laptop in this case and if we look at what's in here we see that we have one surface we call it time scale DB and there we load the image uh from time scale so we get the time scale DB we use postgress version 16 we give the container a name and here we can also set up some of the settings for our poster SQL database and in this case we just specified uh the database the DB which we uh set to the default po postgress we call it postgress and then also the password we set to password so make sure this is all local for testing purposes you of course want to change this if you are going to put this somewhere in production and we put it on the default port and then the final thing important thing to note here is that we're also using a volume and a volume is a way for Docker to create a separate place where you can store your data and when you uh essentially close the docker or stop the container that volume uh with that data is persisted meaning your data stays there so this is an important part to put in there so that is the docker composed file now then the next step is to run this and we can copy this terminal command over here come over here to the IDE open up a new terminal and then make sure that we CD into the docker folder where the docker compos file is and we can run Ducker compost up and with the flag D to run it into detach mode to spin this up so I can do this and if everything works properly and you have configured everything correctly this should be up and running and should show uh running in green over here now we can validate this by coming to uh our Docker uh desktop version which you can see over here and in here we see the time scale DB running where you can see the image is running on the default port for postgress which is now on the Local Host because we're doing this locally on our machine all right so now that this is running we have that database on our system running and we can now connect to it so what you now need is a uh GUI client that you can use to connect now I am using uh table plus but you can also use something like data grip or you can use PG admin or you could use an extension in Fes code or cursor look up your favorite way of doing this but uh with table plus what I am going to do I already have the connection but let me show you how you can set that up so I create a new connection I say postgress uh SQL and then what I can do over here uh with this I can leave this actually links or Local Host links to this IP so I can leave that at the default I can leave the port at default user is going to be the default postgress password is going to be password and then the DB is also going to be postgress and I can test this and that should be green all right I can connect this and now we can come in here and we should have a connection to the database all right and that brings us to step number three and that is inserting the vectors so if we come to the database over here we can see uh there are no tables it's completely empty so let's come over here and come to the app folder and open up insert vectors so let's let's see what's actually going on here and walk you through this so this is a very simple script that we can start up over here and where we can have a look at the data so here we can see an FAQ document which is loading from the data folder so this is all in the repository you can do this as well it's a simple demo example we have question and answer and we have a category this is a hypothetical use case for an e-commerce store that's trying to build up an internal knowledge base so there are question like what are the shipping options uh how can I track my order Etc so the these are going to be the records that we are going to put into the factor database and what we have to do for that is take take this data and vectorize it using an embedding model where we take the the string Text data and convert it to a vector a numerical representation and for you to follow along with this and run this there are two steps that we have to do first uh which you are probably going to be familiar with first of all we have of course the requirements. txt so make sure that you have a python environment I have a virtual environment over here with all of these installed that's number one and then we also have the EnV file so you can see an example EnV file in here we need two variables we need an open AI API key so plug that in here and then we also have the time scale surface URL which is just going to link to the local instance of your postare SQL database so if you didn't change anything about this Docker compos file and you you run this via Docker then this should be the correct URL over here if you change for example the username on the password then you should update this accordingly so take that example. EnV file create a copy call ITV and then make sure to fill in that open AI API key and then lastly before running everything you might want to have a quick look at the settings.
py file which is in the config folder you can leave everything on default but just so you know this is where you can adjust some of the settings so which open AI model we're using to generate the answers which embedding model we're using also making sure that the embedding model so the embedding Dimension aligns with the settings that we have over here so uh 1536 is what we're using for the text embedding 3 small model that should match all of those things you can find it in here and then in order to fill up our database we can use this insert factors. Pi so let's see what we got going on over here so we have the data which we already showed then we have the prepare data function which um here is some documentation on what this is going to do but this essentially going to take the question and the answer and take those columns and put it into uh one single string and then we're going to use our Vector store clause which I created to help with all of this and that is going to create the embeddings and we're going to take this function and we're going to use the apply method on the pandas data frame to apply apply it to everything and embed all of the rows so let's see what that looks like and then I'm going to give a little explanation in here but first let's run everything and here we can see that for all of the records in the database it's going to send it to openai with an embedding uh request and here we can see it is now done let's clean it up and we can now see that we have a data frame with an ID some metadata some content and an embedding in here so the this is the structure that we specify here within the function so again ID metadata content embedding and this structure works out of the box really well with PG Factor scale and more specifically the time scale factor python library that we are also going to leverage behind the scenes but more on that in a bit but just know that we're taking a data frame and we're we're converting it into this structure where we now also have an embedding column which contains the vector now quick look behind the scenes at what's happening at the vector uh store class that we created because we use that uh class over here and we call get embedding so let's have a quick look so this is also in within the project so this is not an external library or anything that we're we're importing this is just the file that you can see over here Vector store and the get embedding is a very simple method where we just make a call to open Ai and use their API to create the embedding that's it we also uh locked the time and and share that but that's about it very simple all right so now we have that data frame that we see on the right over here and we can now use that to upsert it to our database that we created so coming back over here to uh table plus we can see a list of functions over here that we can also uh leverage from the vector store that I created so first of all we can create the tables here so let's look at fact. create tables and run this and this is going to be instant and what we can now do is we can come in here I can hit command R to refresh this and we have an embedding table where we can see that same structure ID metadata content and embedding so let's have a quick look at the create tables function over here that we're calling from the vector store and this is essentially nothing more than a reference to the create tables function from the time scale Vector library and you can find more information on that also uh I I've left those links here in the documentation so this is the blog post the python library for using Etc here you can see the official documentation and the blog post from time scale on this python Library so behind the scenes we're leveraging a lot of that uh that functionality from within that library but all I did essentially here was create a simple uh wrapper with any Vector store to also get our embeddings in here and I like to work with pandas data frames but here you can see uh with the create tables create index drop index it is essentially just referencing that Library so just so you know that so this that is how we create the tables so that's now in Here and Now quick note here because we are using the time scale factor library and we call the create tables directly on the client we use all of their default settings and you will create the table uh like this you have the ID metadata content embedding so that will always be the same if you use this functionality now you can also do everything custom so this whole creation of the tables and for that I would like to refer you to the official GitHub repository over here and have a quick look at what that looks like because here you can also see just by running regular SQL which is what the python rapper is doing behind the scenes you can create all kinds of structures fully custom depending on your use case and just create one column which has the uh Vector data type and then that is where you can store your embeddings in so it's really flexible to accommodate various use cases and before doing so just make sure to also run the create extension if not exists so this is going to add the vector scale extension which is going to allow you to first uh create the The Columns with the vector data type but then also to create the index so just so you know that we're using the python wrapper but you can also do this in plain SQL now the next step is to create the index and remember from the introduction right this is really where uh PG Factor scale shines and how it's different from just leveraging PG Factor so we can run this line as well and it will create the index for us again if we look in here it's just calling create embedding and then we call the disn index in here if you go to the official documentation you can uh look more into the different index options that are there but I would recommend just starting with the dis a&n index and this is is all performing that operation on that uh Vector client that we have established over here with our database surface URL so that's coming from uh from the URL over here that we specified and that's how that class that we created so this Vector store has a link to the database so that's how we create the tables we uh enable the index Etc all right and then last step here with our records data frame here in memory with the VOR embeddings we can call the upsert where we throw in the panda data frame and upsert it to the data frame and here obser it to the database and you can see inserted 20 records into embeddings so let's come over here to our database refresh it and here you can see let me get that out there here you can see all of our uh metadata our content and our embeddings that we have in here and then to show you how this absurd function works here here you can see we use uh def Vector client again absurd and then we just list the record because uh I've added some slight functionality in here to make it work with a pandas data frame so we take the data frame first we convert it to records so this is depending on your preference like I said I like to work with the uh Jupiter interactive uh window over here next to me and with this it's really convenient to work with pandas data frames instead of for example dictionary or Json files because it get can get messy to have a good understanding of all the data that you're dealing with all right and that wraps up step three so we set up the database we connect it and now we've also uh inserted the data into it so now it's time for step number four which is to perform the similarity search and really bring this whole rack system to life so for this we're going to start with the similarity search.
piy file and let me again start this up over here we are going to initialize the same Factor store but now we're going to utilize more functionality in here and this is another reason why I like to wrap some of that functionality even if it's very like low level we're literally just reusing that Vector client but having that all in that same Vector store that we can um use within our own file and also add some documentation to that's just something that I like to do um within our projects so I'm going to walk you through some examples here showing you how we can leverage this rag system and then I'm also going to show you some of the more advanced capabilities like how we can use metadata filtering Advanced filtering using predicates and even time-based filtering to really make this a very powerful and effective rag solution all right so let's go through this file and start with a simple question over here this is going to be a relevant question meaning that the question can be answered using the data that we have so that is what are your shipping options so let's store that in here and we're going to run the fact.