How to Get Your Data Ready for AI Agents (Docs, PDFs, Websites)

32.01k views4869 WordsCopy TextShare

Dave Ebbelaar

Want to get started as a freelancer? Let me help: https://www.datalumina.com/data-freelancer?utm_sou...

Video Transcript:

one of the first things that you want to do when you're building AI agents is giving them access to your own data these could be things like documents PDFs websites anything to give your AI agent specific knowledge about your company or the problem that you're trying to solve and now there are a lot of tools online that can help you to do this but most of them come at the cost mainly that they are close Source meaning you have to get an API key and then you can send data to their platform where they do the parsing and you get the data back but what if I told you that there are also open Source Alternatives available that work just as well so in this video I want to show you how you can build a fully open- Source document extraction pipeline in Python using a library called dock link so in this video I will walk you through this giup repository Link in the description of course we'll dive into some code examples to show you how you can parse PDF documents like this and websites to eventually make them available in a chat application where we can browse through our Factor database search for Relevant cont text and then answer questions about that retrieving sources as citations now throughout this video we will cover some fundamental techniques like extraction parsing chunking embedding and retrieval to show you how you can create a knowledge system for your AI agents end to endend I'm going to show you some specific examples with a specific set of tools Vector databases and AI models but all of these Concepts can be applied to all kinds of situations so it doesn't matter which factor database you're using which embedding model you're using or which AI model you're using it will really form the foundation of building a knowledge system for your application all right now let's get things started now within the read me file in the repository which I have here in front of me there is more information if you want to dive a little bit deeper but in order to follow along you need to uh create an environment and then install the requirements. txt so they are available here uh in the project and you need an open AI API key now the document extraction part is going to be fully open source but I'm still going to use AI to create embeddings and chat with it now this is optional you can just as well use an open source model for this as well just so you know that now in order to go through this entire example on your own which I will do in this video we are going to execute five files so first we're going to extract uh document content then we're going to perform chunking then we're going to create embeddings and put them into a vector database then we're going to test the search functionality and then we're going to bring this all together in the chat app application that I demoed in the beginning so let's dive in starting with the extraction so I have this python file over here which I'm going to boot up and run in the interactive session and we have a PDF over here from the library that we're using to perform this and the library is called Doling here is the technical report so this is a project from IBM so really great company building great tools and they are making this fully open source and based on my experiments and what I've heard from other AI engineers in the scene working with great companies this is by far right now the open-source document extraction library of choice it's already amazing as is and they still have a road map of items that are coming soon to make this Library even more awesome so it's really straightforward to get started with it and this simple file over here is going to show you exactly how to do that so we are going to start with the document converter that we import from the library so this we can do this after a simple pip install dock link which is in the requirements so when we have that up and running we can then convert a PDF so we can run this and this will take some time especially if you do it for the first time because it will download some uh local extraction models to perform this but right now in the background it's going through the PDF and it's going to analyze and look at all the blocks and components form OCR to get a data model back that we will use in the next steps so here you can see that it is now finished and we can now call the document attribute and have a look at what's inside there so we now have a Doling document file so we took the whole PDF we ran it through the system and now we have this object and why this is so awesome coming also to the image that they have over here so what's so nice about this library is that you can throw throw all kinds of data files at it whether that's PDF PowerPoint dockx or or websites and it will turn it into this specific data model this specific data object that they have created called the Doling document that you can see over here and why that is so nice is that we can now unify all kinds of data and create a pipeline or system where it doesn't really matter if you throw a PDF or a web page add it and we can work with it so let's see what that looks like so once we now have that document we can do various things with it so for example we can export to markdown or we can uh export to Jason and we can have a look at what that looks like and also for D Json and you can see if you go all the way over here you can see what's inside but maybe to give a better visual view we can print the markdown output and here you can browse through the document over here where you see hey this is the dolane technical report so you can see Doling technical report the first and here you can see it did a very good job of extracting all of this information from it but hey okay so far there are plenty of other libraries that can do this as well but where this really excels is also in let me scroll down table extraction so a lot of python a lot of Open Source python libraries that parse PDFs struggle with this so for example if I come down over here you see this table here you can see we can uh we have an perfect perfectly formatted markdown table and everything just looks super clean no weird characters all the headings are in correct markdown format so overall out of the books really great result so that was a very simple example of parsing a PDF now let me clear this up and continue with HTML extraction because what we can also do is within that same converter we can call the convert method but then instead of throwing a PDF at it we just throw a website at it so if I go over here to the dock laying docks you can see this page over here and run this result to see what we get this is really fast because it essentially just looks at the HTML and parses it so we can get the document and then the markdown and let's see what that looks like so now we have an exact replica of this web page correctly parsing all of the HTML in air and now having a markdown object that we can use so that was just one page what if we want an entire website for example and get all the pages on there well to do that we can use a trick by leveraging the what is called the sitemap. xml which most websites have so for any given website we can also test this by coming to the browser and then put sitemap.

xml behind the URL and we can look at this and this is an XML file that contains all of the information pages and also URLs from that website and I created a simple helper function called get sitemap urls that goes through this tries to fetch that sitemap. xml and then returns all of the URLs that were found on that specific page so if we then come over here and plug in that same URL right now instead of getting all the information from just a single page we can first get the sitemap URLs and here you can see all of the URLs that are found in the sitap and now as you can imagine we can Loop over this and extract all of the pages one by one and the dock link Library also has a nice method for that so instead of calling convert we can now call it convert all which creates uh an inter object where we can Loop over and extract all of that information so I can simply create that over here and now I create a new variable with an empty list called docs and for all of the results in this it item that we just created I'm going to get the result. document and append it to the list so let's run over that and now we can see all of the dock link documents for all of these specific pages so this is already quite cool right we can throw PDFs at it single web page or an entire website and get an exact data object back and now these were just the basic examples the Doling Library also allows you to specify custom extraction parameters for situations where your data is a little bit more tricky but this really forms the foundation and first step of your knowledge extraction system being able to throw different kinds of documents at it website PDFs doxs whatever and get it into a structured format that we can use in the next step within our system Andy then real quick if you want to learn how we help developers like you beyond these videos then you can check out the links in the description we have a lot of resources available starting from how you can learn python for AI in a free community in free course all the way to our production framework that we use to build and deploy generative AI applications for our client and if you're already at the level where you consider freelancing maybe taking on side projects we can also help you to land that first client so if you're interested in that make sure to check out all of the links there all right so now we know how to extract data the next step is chunking and the dock link Library can also help us with this and what chunking is really is we're taking all of the data and instead of taking the entire document and putting it into a single record within our database we split it up in what we call chunks and we do this in order to hopefully create logical splits and components that fit well together so that when we query our AI system we don't get the entire document or the entire book back but just the specific parts that are relevant to our question but that is not as simple as just splitting the text by every X words or characters but luckily out of the box Doling can also help us with chunking through two different methods and you can also combine them in the hybrid chunker so they first have the hierarchical chunker which essentially looks at the document and it's going to split up the document based on logical components or groups that fit together well so these could be lists or paragraphs and it's going to create those those groups with children that we can already use so this is already a great starting point for chunking and this is performed automatically but we can take that one step further and also apply the hybrid chunker and how that works essentially is explained over here what it does you can split chunks that are too large for your embedding model so remember this we get the data we extract it then we create the chunks and then in a Next Step we're going to send that to an embedding model to create the embedding which we're going to store in the vector database but all embedding models have a specific Max input that we can use so for example if we go to the open AI embeddings let me just pull this up over here we can see that the text embedding small large and AD models have a Max input of the following so that is a total amount of input that you can send to that model so you want to keep your chunks below that specific number of the embedding model that you are using now what Doling then does is it is first going to split chunks that are too large for the embedding model so say it exceeds this input it's going to create a split then also it's going to stitch together chunks that are too small so you might have a chunk which is just one header or a short paragraph and it can fit that together and it works with your specific tokenizer so the chunks will perfectly fit the chosen model that you are working with so that out of the box is just a great way to work with this so within the code example here we're going to use open Ai and we're going to use the embedding large model where we set the max tokens to the number that we see over here in the documentation and in order to to do this I created a simple openai tokenizer wrapper because if you look at into the documentation within Doling they use an open source model that is available via huging phas so I created this simple tokenizer over here that follows the exact API specifications that you need to work with this so let's fire this up and see what this looks like so I'm going to get the same PDF again and for the sake of Simplicity I'm just going to run this one more time in this file so this will take a couple of seconds over here but this is the same action that we did in the previous step getting the PDF all right so we have our result again so this is our par PDF now we are going to apply the hybrid chunker so what we can do we can import the hybrid chunker from the dock Ling Library we can specify the tokenizer we can set it to the open a tokenizer repper that I created then for Max tokens we can use the max input tokens and we can set the merge pairs to True which is the default option so this is going to allow to uh put smaller chunks together as well so this is optional so what we can then do is once I put this into memory and run over this so this is all a syntax that is specific from the doc link uh documentation so I just followed the example and now created the chunks so if I look at this list I can now see that I took the entire document took it through the hybrid chunker and I now have a list of 36 Chun junks so entire PDF now condensed to 36 junks where we know exactly that all of these text blobs over here will fit into the context of the embedding model that we are using so that is amazing right that is already a lot of steps covered that normally takes a lot of work to do well all right so we now have the chunks that we can send to an embedding model to get the factors which we can store in a factor database now in the next example I'm going to use use Lance DB and the specific implementation of the vector database that you're using doesn't really matter typically within our projects I use postare SQL and the PG Factor extension but Lance TB is really easy to work with because the database is stored in a persistent storage just like a SQL light database so the file will literally just show up in your file system and that's just easier to work with they also have a really nice API but just so you know that I'm not going to dive really deep into how to work with lanb so if you want to know more about that you can look into the specific documentation from their website all right so now moving to the three embedding dop file and the beginning is just the same that we already did but just adding steps on top of it to now work with the vector database so let me run all of this code one more time so we get all of the chunks back so I'm creating the database over here which is going to live in the data folder over here now next I'm going to specify a function and this is really specific to Lance DB so if you want to learn more more about that you can reference the docs but what's nice about their API is that we can specify an embedding model so in this case we're using open Ai and we want text embedding three large and we can specify that as a function and then in the next step what we can do we can use a penic model that we inherit from The Lens model to specify what our table should look like so we use penic to specify the structure of our Vector database and we can use that function that we cre cre to specify okay what is the text Source what is the source field that we need to send to the embedding model and also what is going to be the vector column that we need to vectorize and by doing it like this within this clean API we don't have to bother with manually doing sending and retrieving embeddings everything is managed from within the table so that's just a nice thing about lens DP but let's look at the data model that we're using over here here so this is the main schema that we're using so from all of the documents that I extracted I want a text field I want the vector field that we're going to use to perform the search and I want another field called metadata where we plug in the following so I want to put in the file name over there the page numbers that the chunks uh were on and also the title so this should give us next to the text also some important metadata from the documents that we can then use so to look at this we can dive a little bit deeper into the chunks to see what that looks like so here in the last step we had the 36 chunks remember so let's take the first object in here and let's do a model dump on there to see the exact content so here we can see we have the text and then there is a meta key in there as well and there's a lot of information but we are just going to extract the file name we're going to extract the page numbers and we're also going to look at the headers that are potentially available in here so that is the chunk model that we are getting back and we are simply defining penic models around that in order to work with that so let me put that in memory and here you can see how we can create a table within our lens DB database so we take the DB that we initiated over here we do create table we call it dock link uh we specified the schema which is this chunk schema over here and we set the mode to override meaning that if it's already there we just override it so I forgot to put the DB here in memory let me make sure we run everything and now we should be good to go all right perfect so now we have a fresh new table in here within our project now next we can process the chunks so here's a simple piece of code that is going to Loop over all of the chunks and get it ready for our table so let me just run that and show you what it looks like because that is probably easier so here you can see now for the 36 chunks that we had it's going to Loop over all of them and it's going to extract the text and then it's going to set the metadata to the file name page numbers and title it's going to skip everything else and here on the right you can see the result so this is going to exactly match our chunk model so we can now send the data to our table one quick note if you are using this approach using ptic if you have a uh sub model in your document you must order them in alphabetical order otherwise you will get some weird errors this is probably still a bug within uh the code over here but I run into that and it took me probably like an hour or so to figure out what was wrong so with the chunks out of the way we can now take our table remember this is our lens DP table that we created and we can add the chunks so let's send them to our database and the cool thing here with this add function in the background it performed the embeddings as well and that's really cool about the lens DB API uh it can save you a lot of work all of this is pretty straightforward to implement yourself but because we have that function really stored at the table level we can just send this chunk over here and we don't have to worry about the embeddings so what we can now do is we can have a look at a table and see our text we can see our factor and we can see the metadata so the two pandas uh method over here just res Returns the first 10 results but we can also count the rows to check that we have the exact number 36 records in total all right so now we have the parse data with the embeddings in a vector database and again I showed you how to do this with lens TB but you can just as well follow the same principles with any Vector database out there if you want to use postare SQL for example you can watch the other videos on my channel for that and just swap out the log in order to create the embeddings and put it into the table our data is now ready for our AI system or agent to use so let's first look at a very simple example to show you what that looks like and how we can use that information and then in the last step we're going to bring this together in the application so within this fourth file over here called search I'm going to fire up the interactive session again connect to the vector database by simply specifying the local path to our database I'm going to load the dock link table that we created and now through the lb API I can use the simple search method over here with a query in here which is the essentially the user question so what do we want with what query what question or word do we want to search Defector database I'm going to set the query type Defector which is going to perform a similarity search using the embeddings lens Tob also supports uh keyword search and also a hybrid method then I'm going to set the limit to five which is going to limit the amount of results that I'm going to get back from the table so I'm going to run this and then check out the results which I can again convert to a pandas data frame so here you can see the text the factor the metadata and the distance and we can have a look at the specific text chunks that are retrieved when we search for PDF now again I can set this to three and it will run three and I can also for example put in here what's do Ling and then run that result and you can get the different results over here based on that all right so we now covered the parsing the chunking the embedding and the retrieval all the key components that we need in order to create a knowledge system for our AI system or agent now let's see what that looks like if we put it together in an application that we can interact with all right so what I have in front of me over here in the fifth file called chat is a simple Le setup to create a streamlit application you can use this by pip installing the streamlit library and in this video I'm not going to dive into specifically how to set this up you can check out the documents we're going to use the chat Elements which make it really easy to create a simple interactive chat application that we can spin up locally in pure python so this is great for demos and examples so what we're doing in this file high level without getting into all the details is we create a connection with the database then we have a function to search the database and get the relevant context and then we have some specific streamlit components to work with the chat messages and and stream it to the user so in order to boot this up you can walk through this code but what we can do right now within this specific file make sure you are in the Doling folder so in the folder where the chat.