<b>In this video, I will show you an easy</b> <b>solution for preventing</b> <b>duplicate records in your vector</b> <b>stores. To demonstrate this, let's</b> <b>pretend we created a customer support</b> <b>chatbot for a restaurant. </b> <b>So here we have a very simple rack</b> <b>chatbot that upsets data from my Word</b> <b>document.
And let's have</b> <b>a look at that Word doc. It's a simple</b> <b>Q&A for a restaurant called</b> <b>The Oak and Barrel. So this</b> <b>contains some very simple information</b> <b>about the restaurant, and we</b> <b>can also see information about</b> <b>the menu.
For instance, we can see the</b> <b>weekly specials over</b> <b>year. So let's assume that for</b> <b>this week there is 50% off on all sushi</b> <b>and a happy hour between</b> <b>4 and 6 pm. So what the</b> <b>restaurant would do is upload their</b> <b>knowledge base.
We would then "absert"</b> <b>this data into our vector</b> <b>store. And now we can see that 29 records</b> <b>were added to the vector</b> <b>store. And in this demo,</b> <b>we are using a pinecone vector store.
So</b> <b>if we go over to pinecone,</b> <b>we can see the 29 records</b> <b>over year. Right, so that's all great. </b> <b>And of course, we can now</b> <b>test this in the chat.
So let's</b> <b>ask what are the specials? And this is</b> <b>telling us that there is 50% off on all</b> <b>sushi and happy hour</b> <b>between 4 and 6. Great, that's working</b> <b>perfectly.
Now in the real world, these</b> <b>specials would change</b> <b>all the time, maybe on a weekly or daily</b> <b>basis. So what the restaurant would</b> <b>typically do is update</b> <b>their Q&A document. As an example, on</b> <b>their menu, they might change the special</b> <b>from sushi to steaks,</b> <b>like so.
So it would make sense to now</b> <b>upload the latest file. So let's save</b> <b>this. Let's click on</b> <b>"absert".
And this will tell us that 29</b> <b>documents were added just as before. </b> <b>However, if we refresh</b> <b>the vector store, we will now notice that</b> <b>there are 58 records, which</b> <b>means that although we made</b> <b>one small change to the knowledge base,</b> <b>all the documents were</b> <b>duplicated. And this can cause some</b> <b>serious confusion for the chatbot.
As</b> <b>within these records, there's a record</b> <b>that says that the sushi</b> <b>is on special. And there's also a record</b> <b>that says the steaks are on</b> <b>special. So if we actually test</b> <b>this, let's clear the chat.
Let's ask,</b> <b>what are the specials? We</b> <b>can see that this response is</b> <b>actually providing both sushi and steaks</b> <b>as the current specials, which is not</b> <b>correct. Thankfully,</b> <b>Flow-wise offers a very simple solution</b> <b>for preventing</b> <b>duplicates and keeping the vector</b> <b>store clean and up to date.
And this</b> <b>solution is called Record</b> <b>Manager. First, let me show you the</b> <b>benefit of adding Record Manager to the</b> <b>chat flow. And I'll then</b> <b>take you through the process of</b> <b>setting up Record Manager step by step.
</b> <b>Let's close this chat. And</b> <b>let's move these nodes over,</b> <b>like so. In this example, we are using</b> <b>pinecone, but the other vector store</b> <b>support this as well.
</b> <b>On the vector store node, you can see an</b> <b>input for Record Manager. </b> <b>And if we hover over this,</b> <b>it says, keep track of the records to</b> <b>prevent duplication. So</b> <b>I've already set up this Record</b> <b>Manager node.
And of course, we will go</b> <b>through the step by</b> <b>step in a few minutes. </b> <b>But I first want to show you the benefit</b> <b>of including this node. </b> <b>Let's attach it to the</b> <b>pinecone vector store.
Then I'm going to</b> <b>select the original</b> <b>document again. So that's the one</b> <b>with the sushi special. I'm going to save</b> <b>this.
And before we</b> <b>up-serve this, I'm first going to</b> <b>clean the pinecone database. So I'm first</b> <b>going to delete all 58 of</b> <b>these records, just so that we</b> <b>have a clean vector store to start off</b> <b>with. Great.
So I've now</b> <b>cleared out the vector store. </b> <b>So we have zero vector records at the</b> <b>moment. So back in</b> <b>Flow-wise, I'm going to up-serve this</b> <b>document.
And this is this one that's got</b> <b>the sushi as the special. So</b> <b>let's go ahead and click on</b> <b>up-serve. Let's up-serve this.
So because</b> <b>this is our first time</b> <b>uploading this document,</b> <b>29 records were indeed added. And we can</b> <b>see there's 29 records in</b> <b>the vector store. So let's</b> <b>upload this document now.
This is the one</b> <b>where we changed the</b> <b>special from sushi to steaks. </b> <b>So back in Flow-wise, let's select that</b> <b>document. Let's save this.
</b> <b>And watch what happens when we</b> <b>up-serve this document now. We can now</b> <b>see that 28 records were</b> <b>skipped, one record was deleted,</b> <b>and one document was added. And that is</b> <b>because the document that</b> <b>contained the change was removed</b> <b>from the vector store, and the new</b> <b>information was added.
Also, if we</b> <b>refresh the vector store,</b> <b>we can see that only 29 records exist, so</b> <b>no duplicates. And if we ask the chat,</b> <b>what are the specials, it will tell us</b> <b>that the steaks are on</b> <b>special. And that is the latest</b> <b>change which we uploaded.
Right, so let's</b> <b>have a look at how we can</b> <b>add record manager to our</b> <b>chat flows. I'm not going to build this</b> <b>entire chatbot from</b> <b>scratch. So if you would like to</b> <b>learn how to build rag chatbots, then</b> <b>check out my other video over here.
So</b> <b>what I am going to do</b> <b>is delete this text split there. I'm</b> <b>going to delete this</b> <b>document node. And let's also remove</b> <b>this record manager node.
So now we are</b> <b>left with this conversation retrieval</b> <b>chain, the chat open</b> <b>AI node, the embeddings node, and the</b> <b>pinecone database. Awesome. </b> <b>I'm also going to clear the</b> <b>records in my pinecone database.
Great,</b> <b>we now have a clean vector</b> <b>store to work with. So let's</b> <b>have a look at adding the record manager</b> <b>node to our canvas. First</b> <b>go to add nodes, then go</b> <b>to the record manager folder.
At the time</b> <b>of recording, there are</b> <b>three different databases</b> <b>that are supported my SQL postgres and</b> <b>SQLite. Feel free to use any</b> <b>of these. But for this demo,</b> <b>I am going to use a Postgres database and</b> <b>you will be able to</b> <b>follow along for free.
So let's</b> <b>add this Postgres record manager node. </b> <b>And let's connect this to the</b> <b>spinecone node for this post</b> <b>press record manager node. We now need to</b> <b>connect a Postgres database.
We can</b> <b>create a free Postgres</b> <b>database using Superbase. So go over to</b> <b>superbase. com, then sign in</b> <b>and create your account.
After</b> <b>signing in, you should see your</b> <b>dashboard. Click on new project, give</b> <b>your project a name, I'll call</b> <b>mine flow wise tutorial, then also create</b> <b>a database password and</b> <b>make sure to store that</b> <b>password somewhere as we will need that</b> <b>password later on in this</b> <b>video. Select your region and</b> <b>click on create new project.
This will</b> <b>take a minute or two to</b> <b>set up your new project. </b> <b>Once the project is up and running, go to</b> <b>project settings, then click</b> <b>on database and look for this</b> <b>section called connection parameters back</b> <b>in flow wise under connect</b> <b>credentials, click on this</b> <b>drop down and click on create new give</b> <b>your credentials a name. </b> <b>I'll call mine Postgres record</b> <b>manager tutorial, then we need to provide</b> <b>a username and password.
</b> <b>So paste in the password</b> <b>that you copied from earlier and under</b> <b>user simply copy the user</b> <b>from Superbase and let's</b> <b>paste it into this field and let's click</b> <b>on add. And by the way, if</b> <b>you forgot your password,</b> <b>you can simply go to this database</b> <b>password section and reset</b> <b>your database password. Let's</b> <b>also copy this host name and let's paste</b> <b>that into the record</b> <b>manager node.
Let's also get the</b> <b>database name by copying this value and</b> <b>let's paste it into this field. Let's</b> <b>also copy the port and</b> <b>let's add it to this field as well. And</b> <b>that is actually all we</b> <b>need to do to connect flow wise</b> <b>to the Postgres record manager.
Now when</b> <b>we click on additional</b> <b>parameters, we can set the table name</b> <b>and this is the table name that will be</b> <b>created in the Postgres</b> <b>database, we can simply leave this</b> <b>as the default value. We can also leave</b> <b>the namespace. These two</b> <b>fields are very important</b> <b>and I will explain them in a second.
</b> <b>Under cleanup mode, we</b> <b>have three different options,</b> <b>none, incremental and full. We can also</b> <b>specify the source ID key</b> <b>value, but for most use cases,</b> <b>we can simply leave this as source. This</b> <b>is simply the value in the documents</b> <b>metadata that will be</b> <b>used to compare these different</b> <b>documents.
Let's simply leave this as</b> <b>source. Let's close this</b> <b>pop up and let's now have a look at how</b> <b>these actually work. Back</b> <b>in Superbase, let's actually</b> <b>go to database and you will notice that</b> <b>there are no tables at the</b> <b>moment.
And as a reminder,</b> <b>our pinecone vector store is also</b> <b>initial. So how this</b> <b>actually works is when we</b> <b>absurd data, the data will be chunked as</b> <b>per usual and stored in</b> <b>the vector database. But a</b> <b>hashed value of the data will be stored</b> <b>in the Postgres database as</b> <b>well.
And it is the Postgres</b> <b>database that will tell this process</b> <b>whether the records already exist or if</b> <b>there were any changes</b> <b>to the data and the vector store will</b> <b>only be changed if the</b> <b>record manager allows the process</b> <b>to continue. Now let's have a look at a</b> <b>very simple example. Let's</b> <b>go to add nodes and under</b> <b>document loaders, let's add the plain</b> <b>text node.
Of course, in your projects,</b> <b>you will probably want</b> <b>to use a PDF document or a Word document</b> <b>similar to what we did in</b> <b>the beginning of this video. </b> <b>But I think in order to explain how</b> <b>record manager works, it simply stick</b> <b>with a simple plain text</b> <b>node. Let's attach this to the document</b> <b>input on the pinecone node</b> <b>and let's provide some text</b> <b>like dog.
Now what we also want to do</b> <b>when using record manager is</b> <b>to specify a source metadata</b> <b>key value. As a reminder in this record</b> <b>manager, when we go to additional</b> <b>parameters, we can see</b> <b>the source ID value over here. So record</b> <b>manager is going to look for</b> <b>a metadata key value called</b> <b>source to compare these different</b> <b>documents.
So let's create</b> <b>this value in the metadata of our</b> <b>document. In this node, let's click on</b> <b>additional parameters and on the</b> <b>metadata, let's click on add. </b> <b>Let's give this a key name, which is</b> <b>source.
And now we can</b> <b>specify pretty much any value. </b> <b>This would typically be something like</b> <b>the file name or some unique identifier. </b> <b>To keep this simple,</b> <b>let's simply call this dog and let's pick</b> <b>on this tick box to</b> <b>submit this value.
By the way,</b> <b>if you're unable to view and change these</b> <b>metadata values, you might</b> <b>want to try a different browser. </b> <b>I've had issues changing these using</b> <b>Chrome, but it seems to work just fine</b> <b>using Edge. Let's close</b> <b>this pop-up and let's actually add</b> <b>another document loader to this project.
</b> <b>And by the way, this is a</b> <b>small pro tip, as a lot of you have been</b> <b>asking me in the comments if it's</b> <b>possible to use more</b> <b>than one document loader in a workflow. </b> <b>And yes, you can. Let's add another</b> <b>document loader.
Let's</b> <b>add the plain text loader like so. And we</b> <b>can simply attach this to</b> <b>the same document input on</b> <b>the Pineco node. So when we perform the</b> <b>upset, both of these</b> <b>document loaders will be executed.
</b> <b>For this one, let's call it cat and under</b> <b>the additional</b> <b>parameters, let's add a new metadata</b> <b>value called source and let's set this</b> <b>source to cat like so. </b> <b>Right. So we now have two document</b> <b>loaders with unique document sources.
</b> <b>Let's save this and in the record</b> <b>manager, let's go to</b> <b>additional parameters and let's first</b> <b>have a look at the cleanup method of</b> <b>none. So let's save this</b> <b>chat flow. Let's run the upset.
Let's</b> <b>click on upset and you will</b> <b>notice that two records were</b> <b>added to our vector store, one for dog</b> <b>and one for cat. And both</b> <b>of these have source values. </b> <b>This one is dog and this one is cat.
And</b> <b>in the vector store, we can</b> <b>see that two documents were</b> <b>indeed inserted. And if we go to our</b> <b>Postgres database, we can</b> <b>see that this table was created</b> <b>and we have two entries at the moment, we</b> <b>can view these entries by</b> <b>clicking on these three dots</b> <b>and view table. And these values won't</b> <b>make a whole lot of sense to us.
But I</b> <b>just wanted to show you</b> <b>what these entries look like in the</b> <b>database. Now watch what happens if I</b> <b>execute this upset again,</b> <b>usually this would result in duplicate</b> <b>values. But let's have a look at what's</b> <b>going to happen now.
</b> <b>This time, two records were skipped. And</b> <b>that is because the record</b> <b>manager determined that no</b> <b>changes were made. And therefore, these</b> <b>two records were skipped.
Now,</b> <b>unfortunately, this method of</b> <b>none will not perform any cleanups</b> <b>either. So if I change this</b> <b>to dog two, let's save this,</b> <b>let's click on upset. And this might say</b> <b>that one record was added</b> <b>and one was skipped.
But in</b> <b>reality, nothing was actually changed. If</b> <b>I refresh the vector store, we can see</b> <b>that the text is still</b> <b>only dog and not dog two. So the method</b> <b>of none is not very helpful</b> <b>for recording any changes.
</b> <b>Now let's have a look at the second</b> <b>method. And that is the incremental</b> <b>method. And I think in</b> <b>order to demonstrate this, I'm actually</b> <b>going to delete the records</b> <b>from the Postgres database.
</b> <b>And I'm also going to clear our vector</b> <b>store. Great. So what the</b> <b>incremental method will do</b> <b>is it will record any changes that we</b> <b>make.
So as an example, if we had to</b> <b>change the value from</b> <b>dog to dog to that change will be</b> <b>recorded, but it will not delete any</b> <b>records. So in other words,</b> <b>if we had to delete cat, the cat record</b> <b>will remain in the vector</b> <b>store only changes will be</b> <b>recorded. Let's test this out.
So first,</b> <b>I'm going to upset these two documents. </b> <b>Great. So two records</b> <b>were added.
It's changed dog to dog two. </b> <b>And let's delete the cat node</b> <b>completely. It's saved this,</b> <b>let's run upset again.
And this time we</b> <b>can see that one record was</b> <b>added and one was deleted. </b> <b>And if we go back to pinecone, and let's</b> <b>refresh this, we can see that</b> <b>dog wasn't the change to dog</b> <b>two. So this is recording new changes.
</b> <b>However, it did not delete</b> <b>cat. So that is what incremental</b> <b>will do for us. It will only store any</b> <b>new changes, but it will not</b> <b>delete any source documents</b> <b>that were not part of this execution.
So</b> <b>just to be clear on what</b> <b>this pop up showed is we added</b> <b>a new record containing the text dog to</b> <b>add the previous record</b> <b>that contained only dog was</b> <b>deleted. It was not the cat record that</b> <b>was deleted. Great.
Now</b> <b>let's have a look at the</b> <b>final example. And that is the full</b> <b>cleanup method. So we can</b> <b>use the full cleanup method</b> <b>to delete any documents that are not part</b> <b>of this execution.
So just</b> <b>to make sure that this is</b> <b>clearly explained, I'm going to clear out</b> <b>the vector store again,</b> <b>I'm also going to delete the</b> <b>records in the postgres database. Let's</b> <b>revert this back to dog. </b> <b>And I'm also going to add back</b> <b>our plain text node like so.
Let's also</b> <b>add this to our pinecone</b> <b>database. It's added text cat,</b> <b>let's also get to additional parameters. </b> <b>Let's add the source as</b> <b>cat, just like we had before,</b> <b>let's save this, let's run the upsell, so</b> <b>two records were added.
</b> <b>But now let's change the</b> <b>text from dog to dog two. And let's</b> <b>delete the cat node, just like we did</b> <b>with the incremental</b> <b>example. But what fool is going to do is</b> <b>it's going to record this</b> <b>change for dog.
And because</b> <b>we are no longer passing the cat source</b> <b>document, that document should be</b> <b>deleted. Let's save this,</b> <b>let's run the upsell, and let's see what</b> <b>happens. We can now see that</b> <b>one record was added.
And that</b> <b>is the new record for dog two, and two</b> <b>documents were deleted,</b> <b>that would be the original dog</b> <b>record, as well as the record related to</b> <b>cat. So let's refresh</b> <b>pinecone. And now we can see that we</b> <b>only have the dog to record and cat was</b> <b>deleted.
If you like this</b> <b>video, then please hit the like</b> <b>button to support my channel, and</b> <b>subscribe to my channel as well for more</b> <b>content on Flowwise. </b> <b>And you might also be interested in this</b> <b>other video where I show you 8 hidden</b> <b>features in Flowwise.