In this video, I will show you an easy solution for preventing duplicate records in your vector stores. To demonstrate this, let's pretend we created a customer support chatbot for a restaurant. So here we have a very simple rack chatbot that upsets data from my Word document.
And let's have a look at that Word doc. It's a simple Q&A for a restaurant called The Oak and Barrel. So this contains some very simple information about the restaurant, and we can also see information about the menu.
For instance, we can see the weekly specials over year. So let's assume that for this week there is 50% off on all sushi and a happy hour between 4 and 6 pm. So what the restaurant would do is upload their knowledge base.
We would then "absert" this data into our vector store. And now we can see that 29 records were added to the vector store. And in this demo, we are using a pinecone vector store.
So if we go over to pinecone, we can see the 29 records over year. Right, so that's all great. And of course, we can now test this in the chat.
So let's ask what are the specials? And this is telling us that there is 50% off on all sushi and happy hour between 4 and 6. Great, that's working perfectly.
Now in the real world, these specials would change all the time, maybe on a weekly or daily basis. So what the restaurant would typically do is update their Q&A document. As an example, on their menu, they might change the special from sushi to steaks, like so.
So it would make sense to now upload the latest file. So let's save this. Let's click on "absert".
And this will tell us that 29 documents were added just as before. However, if we refresh the vector store, we will now notice that there are 58 records, which means that although we made one small change to the knowledge base, all the documents were duplicated. And this can cause some serious confusion for the chatbot.
As within these records, there's a record that says that the sushi is on special. And there's also a record that says the steaks are on special. So if we actually test this, let's clear the chat.
Let's ask, what are the specials? We can see that this response is actually providing both sushi and steaks as the current specials, which is not correct. Thankfully, Flow-wise offers a very simple solution for preventing duplicates and keeping the vector store clean and up to date.
And this solution is called Record Manager. First, let me show you the benefit of adding Record Manager to the chat flow. And I'll then take you through the process of setting up Record Manager step by step.
Let's close this chat. And let's move these nodes over, like so. In this example, we are using pinecone, but the other vector store support this as well.
On the vector store node, you can see an input for Record Manager. And if we hover over this, it says, keep track of the records to prevent duplication. So I've already set up this Record Manager node.
And of course, we will go through the step by step in a few minutes. But I first want to show you the benefit of including this node. Let's attach it to the pinecone vector store.
Then I'm going to select the original document again. So that's the one with the sushi special. I'm going to save this.
And before we up-serve this, I'm first going to clean the pinecone database. So I'm first going to delete all 58 of these records, just so that we have a clean vector store to start off with. Great.
So I've now cleared out the vector store. So we have zero vector records at the moment. So back in Flow-wise, I'm going to up-serve this document.
And this is this one that's got the sushi as the special. So let's go ahead and click on up-serve. Let's up-serve this.
So because this is our first time uploading this document, 29 records were indeed added. And we can see there's 29 records in the vector store. So let's upload this document now.
This is the one where we changed the special from sushi to steaks. So back in Flow-wise, let's select that document. Let's save this.
And watch what happens when we up-serve this document now. We can now see that 28 records were skipped, one record was deleted, and one document was added. And that is because the document that contained the change was removed from the vector store, and the new information was added.
Also, if we refresh the vector store, we can see that only 29 records exist, so no duplicates. And if we ask the chat, what are the specials, it will tell us that the steaks are on special. And that is the latest change which we uploaded.
Right, so let's have a look at how we can add record manager to our chat flows. I'm not going to build this entire chatbot from scratch. So if you would like to learn how to build rag chatbots, then check out my other video over here.
So what I am going to do is delete this text split there. I'm going to delete this document node. And let's also remove this record manager node.
So now we are left with this conversation retrieval chain, the chat open AI node, the embeddings node, and the pinecone database. Awesome. I'm also going to clear the records in my pinecone database.
Great, we now have a clean vector store to work with. So let's have a look at adding the record manager node to our canvas. First go to add nodes, then go to the record manager folder.
At the time of recording, there are three different databases that are supported my SQL postgres and SQLite. Feel free to use any of these. But for this demo, I am going to use a Postgres database and you will be able to follow along for free.
So let's add this Postgres record manager node. And let's connect this to the spinecone node for this post press record manager node. We now need to connect a Postgres database.
We can create a free Postgres database using Superbase. So go over to superbase. com, then sign in and create your account.
After signing in, you should see your dashboard. Click on new project, give your project a name, I'll call mine flow wise tutorial, then also create a database password and make sure to store that password somewhere as we will need that password later on in this video. Select your region and click on create new project.
This will take a minute or two to set up your new project. Once the project is up and running, go to project settings, then click on database and look for this section called connection parameters back in flow wise under connect credentials, click on this drop down and click on create new give your credentials a name. I'll call mine Postgres record manager tutorial, then we need to provide a username and password.
So paste in the password that you copied from earlier and under user simply copy the user from Superbase and let's paste it into this field and let's click on add. And by the way, if you forgot your password, you can simply go to this database password section and reset your database password. Let's also copy this host name and let's paste that into the record manager node.
Let's also get the database name by copying this value and let's paste it into this field. Let's also copy the port and let's add it to this field as well. And that is actually all we need to do to connect flow wise to the Postgres record manager.
Now when we click on additional parameters, we can set the table name and this is the table name that will be created in the Postgres database, we can simply leave this as the default value. We can also leave the namespace. These two fields are very important and I will explain them in a second.
Under cleanup mode, we have three different options, none, incremental and full. We can also specify the source ID key value, but for most use cases, we can simply leave this as source. This is simply the value in the documents metadata that will be used to compare these different documents.
Let's simply leave this as source. Let's close this pop up and let's now have a look at how these actually work. Back in Superbase, let's actually go to database and you will notice that there are no tables at the moment.
And as a reminder, our pinecone vector store is also initial. So how this actually works is when we absurd data, the data will be chunked as per usual and stored in the vector database. But a hashed value of the data will be stored in the Postgres database as well.
And it is the Postgres database that will tell this process whether the records already exist or if there were any changes to the data and the vector store will only be changed if the record manager allows the process to continue. Now let's have a look at a very simple example. Let's go to add nodes and under document loaders, let's add the plain text node.
Of course, in your projects, you will probably want to use a PDF document or a Word document similar to what we did in the beginning of this video. But I think in order to explain how record manager works, it simply stick with a simple plain text node. Let's attach this to the document input on the pinecone node and let's provide some text like dog.
Now what we also want to do when using record manager is to specify a source metadata key value. As a reminder in this record manager, when we go to additional parameters, we can see the source ID value over here. So record manager is going to look for a metadata key value called source to compare these different documents.
So let's create this value in the metadata of our document. In this node, let's click on additional parameters and on the metadata, let's click on add. Let's give this a key name, which is source.
And now we can specify pretty much any value. This would typically be something like the file name or some unique identifier. To keep this simple, let's simply call this dog and let's pick on this tick box to submit this value.
By the way, if you're unable to view and change these metadata values, you might want to try a different browser. I've had issues changing these using Chrome, but it seems to work just fine using Edge. Let's close this pop-up and let's actually add another document loader to this project.
And by the way, this is a small pro tip, as a lot of you have been asking me in the comments if it's possible to use more than one document loader in a workflow. And yes, you can. Let's add another document loader.
Let's add the plain text loader like so. And we can simply attach this to the same document input on the Pineco node. So when we perform the upset, both of these document loaders will be executed.
For this one, let's call it cat and under the additional parameters, let's add a new metadata value called source and let's set this source to cat like so. Right. So we now have two document loaders with unique document sources.
Let's save this and in the record manager, let's go to additional parameters and let's first have a look at the cleanup method of none. So let's save this chat flow. Let's run the upset.
Let's click on upset and you will notice that two records were added to our vector store, one for dog and one for cat. And both of these have source values. This one is dog and this one is cat.
And in the vector store, we can see that two documents were indeed inserted. And if we go to our Postgres database, we can see that this table was created and we have two entries at the moment, we can view these entries by clicking on these three dots and view table. And these values won't make a whole lot of sense to us.
But I just wanted to show you what these entries look like in the database. Now watch what happens if I execute this upset again, usually this would result in duplicate values. But let's have a look at what's going to happen now.
This time, two records were skipped. And that is because the record manager determined that no changes were made. And therefore, these two records were skipped.
Now, unfortunately, this method of none will not perform any cleanups either. So if I change this to dog two, let's save this, let's click on upset. And this might say that one record was added and one was skipped.
But in reality, nothing was actually changed. If I refresh the vector store, we can see that the text is still only dog and not dog two. So the method of none is not very helpful for recording any changes.
Now let's have a look at the second method. And that is the incremental method. And I think in order to demonstrate this, I'm actually going to delete the records from the Postgres database.
And I'm also going to clear our vector store. Great. So what the incremental method will do is it will record any changes that we make.
So as an example, if we had to change the value from dog to dog to that change will be recorded, but it will not delete any records. So in other words, if we had to delete cat, the cat record will remain in the vector store only changes will be recorded. Let's test this out.
So first, I'm going to upset these two documents. Great. So two records were added.
It's changed dog to dog two. And let's delete the cat node completely. It's saved this, let's run upset again.
And this time we can see that one record was added and one was deleted. And if we go back to pinecone, and let's refresh this, we can see that dog wasn't the change to dog two. So this is recording new changes.
However, it did not delete cat. So that is what incremental will do for us. It will only store any new changes, but it will not delete any source documents that were not part of this execution.
So just to be clear on what this pop up showed is we added a new record containing the text dog to add the previous record that contained only dog was deleted. It was not the cat record that was deleted. Great.
Now let's have a look at the final example. And that is the full cleanup method. So we can use the full cleanup method to delete any documents that are not part of this execution.
So just to make sure that this is clearly explained, I'm going to clear out the vector store again, I'm also going to delete the records in the postgres database. Let's revert this back to dog. And I'm also going to add back our plain text node like so.
Let's also add this to our pinecone database. It's added text cat, let's also get to additional parameters. Let's add the source as cat, just like we had before, let's save this, let's run the upsell, so two records were added.
But now let's change the text from dog to dog two. And let's delete the cat node, just like we did with the incremental example. But what fool is going to do is it's going to record this change for dog.
And because we are no longer passing the cat source document, that document should be deleted. Let's save this, let's run the upsell, and let's see what happens. We can now see that one record was added.
And that is the new record for dog two, and two documents were deleted, that would be the original dog record, as well as the record related to cat. So let's refresh pinecone. And now we can see that we only have the dog to record and cat was deleted.
If you like this video, then please hit the like button to support my channel, and subscribe to my channel as well for more content on Flowwise. And you might also be interested in this other video where I show you 8 hidden features in Flowwise.