Extracting Structured Data From PDFs | Full Python AI project for beginners (ft Docker)

50k views6032 WordsCopy TextShare

Thu Vu data analytics

Download Docker Desktop 👉 https://dockr.ly/4e7k8tQ Containerize your generative AI application 👉 h...

Video Transcript:

every year in mid-september the egg Nobel prizes are awarded researchers receive recognition for their unusual studies in the spirit of the EG Noble prizes and the current AI hype I decide to create an absurd project of my own building an app that uses a large language model to extract key information about the eight noble prizes research papers including titles summaries authors and years of publication now you might be thinking do we really need to use AI for that well not really you could of course read the papers and extract the information yourself however this type of smart application can be useful for many purposes and use cases imagine having a magical system that helps you automatically extract all the necessary information from unstructured data think PDF documents books business invoices customer queries and even images and then neatly organized information in a table structure you define consider how many hours it could save you if it works so let's find out in this video we are going to interact with a large language model like GPT 40 in Python using its API then we'll build a document retrieval system to answer our questions based on a given research paper in PDF then we'll take this to a whole new level by getting answers not only as text but also in a structured format that we defined this is now made possible with high accuracy thanks to a new feature from open AI apis called structured outputs we'll dive into that later then we'll also make our app site the exact text sources that we used to generate the answer this will make our application more trustworthy and reliable then we'll go ahead and make the app more userfriendly by wrapping it in a nice looking streamlet interface and finally we containerize and deploy our amazing app using Docker all this may sound a bit complicated but don't worry I'll walk you through every step of the project and discuss high level Concepts so even if you're not familiar with coding in Python and the technical aspects you can still get a good idea of how things work this video is sponsored by Docker which is an open platform for developing shipping and running applications more on them later as AI becomes more mature businesses realize that one of the most important use cases for AI is information retrieval extracting structured information from unstructured data from a custom knowledge base you can think of this process as sifting through many reports or documents reading them understanding them and finding the right pieces information and organizing them into a structured format like an Excel table let's admit it most of us would try to avoid this kind of task it's tedious and can be extremely timeconsuming yet many office jobs involve these types of tasks to some extent the good news is that this process can now will be automated with great accuracy leveraging Advanced language models like jat TBT and CLA AI this kind of system is called retrieval augmented generation or rack for short if chat TBT allows us to ask what's Mark zuber's salary the rack system allows us to ask what's Marx's salary according to the given documents according to the 2023 retour report an impressive 36% of Enterprise LM use cases involved of rack technology rack applications are great when the knowledge base contains proprietary sensitive and highly specialized information that out of the box LMS like chat TPT or Cloud AI wouldn't know about in such cases these AI models would likely hallucinate or make up information luckily a rack system can help avoid hallucination by limiting the context and allow the system to refuse to answer when lacking sufficient context it can also site data sources where the answer is based making the system more trustworthy this is a clear advantage of a rack system compared to fine-tuning a large language model using a custom knowledge base because the language model will still operate like a black box and it's also often a lot more of a technical hurdle so how does a rack system work exactly a rack system contains three main steps the first step is to process the documents and the Second Step based on the the user's question query for Relevant parts of the documents that likely contain the answer and finally craft the response based on those relevant documents using an LM of choice for example open AI models or Cloud AI or even a local LM if you look closely the first two steps comprise the retrieval part that looks up the relevant information from Context and the last step is the augmented generation part where we utilize the reasoning capability of large language models to generate highly accurate and relevant responses based on the retrieved information all right now let me walk you through the code to see how we can implement this system without further Ado let's get started let's now create a new project folder and I'll call this rack airms I'll also create a virtual environment inside this folder to encapsulate the dependencies of the project and I'll put my virtual environment in a subdirectory called my EnV and then I'll go ahead and activate this virtual environment with this command now let's open this project in vs code you can also use Jupiter lab or Jupiter notebook if you prefer now let's go ahead and create a new Jupiter notebook I'll call it data extraction we first install all the necessary python packages and modules firstly we need L chain which is a flexible framework for building applications with large language models and then we also need to install Lang chain Community which contains essential third party Integrations for Lang chain and then we will also install Lang chain open AI which is a package that connects Lang chain with openai and then chroma DB which is an open-source Vector database and we'll talk about that later and then we'll also install P PDF which is a package for reading and passing PDF documents in Python and pendas for data wrangling and streamlet for building quick web apps in Python and then we also need python. EnV which is a package for setting environment variables in Python let's run this to install all these packages and note that if you haven't selected your python environment then you need to change to the current virtual environment here now after the installing is done we can go ahead and import these packages and a bunch of modules don't worry about what they are used for yet our explain later when we actually use them in this project we be using open AI models this is just an example in practice we could also use apis from anthropic moli or others if you want to try different models you could even use an open-source local LM like Lama 3 these open- Source models are free to use but since we often can only run a small version of those models locally you might actually need to compromise a bit on the output quality so so if you want to see the best performance for your task I recommend trying out the most advanced models available and since open ai's API is a paid service you need some credit in your account to be able to use it so if you go to platform. open.

com you should be able to sign in with your chat BT account and once you've signed in this is what you should see and under the settings and under this building tab you can see your credit balance and and um note that if your credit balance reaches $0 your API requests will stop working so if that's the case you might want to add to credit balance uh click on this button and add something like $5 which is the minimum and that is more than enough for this toy project this is my credit balance and actually I've only used a few cents so far so that is pretty inexpensive once that's done we should be good to go now the next step is to create an open a API key so if we click on the default project button over here and go to organization overview you see all your project listed here by default you have a default project in your account I have another project here called YouTube video that I created earlier and to create a new project we go ahead and click on this create button over here we give our project a name so for example I'll say this is rack project and then we can give some description um so um data extraction from PDF documents for example and then somehow we also need to add our business website so I'm just going to add my YouTube channel over here if you don't have it you can also use other links as well so let's click on create after we have our project we can go ahead and uh click on this API Keys menu here and under this menu we can create a new secret key so click on this button and we will have the name of our API key so I'll just say rack key we assign a project to it so I'll assign the rack project and create secret key all right now we'll make sure to copy this key and go back to our project now to protect our API key we'll create a hidden file inside our project folder we call this file. NV and in this file we will add our open AI API key now let's save this file and going back to our notebook if we run all these loading all these different modules and packages and if we run this load. EnV function here that we imported from this environment module this function will read all the key value pairs from the EnV file we just created and set them as environment variables and now if we run os.

environment. getet um this variable then we'll get back exactly our API key that we set in the EnV file now let me save this variable as open AI API key for reference later and we Define our LM as um the chat open AI which is an integration with openi models that we imported from this L chain open AI package then we will specify the model that we want to use so in my case I'm using the GPD 40 mini which is a small lightweight and also very cheap model and it's very suitable for smaller tasks and optionally you can also pass in the API key which is the open AI API key variable that we just defined but this is optional because by default this chat open AI class will also take the you know the API key if it's not passed in it will be read from the environment variable that is open API key so since we already set our environment variable we don't need to specify this but for clarity I just add it here so once we've defined our LM we can start using it and we call the invoke method for example I'll pass in a query tell me a joke about cats and now if we run this sale we get back a response why did the cat sit on the computer because it wants to keep an eye on the mouse so this is great our API key is working now let's now dive into the first part of our rack project which is to process the PDF document I'll go ahead and create a new folder in our project and I'll call that folder data and inside this folder I'm going to add a few sample PDF documents the first paper is about problems with using long words needlessly and the other paper is um estimation of the total saliva volume produced per day in 5-year-old children these are some of the research that actually won the ignorable prices in the past I'll go ahead and add these files into my data folder let's first um load the PDF document we're going to use the P PDF loader um function from L chain and we specify here the path to our PDF file and the name of the file which is open Heimer 2006 blah blah blah and then um we'll load in all the text per Pages if we run this we will be able to see this Pages variable here contains a list of document objects and each of these document objects represents a page in this PDF and with the metadata containing the source and the page and the page content the next problem is that a single document can be really really long we cannot just pass that giant chunk of text as the context for the LM to answer questions and this is because of a few different reasons firstly there's a token limit to each API request you make for example for GPT 40 model you have up to 128,000 tokens this is a total amount of tokens shared between your prompt and the chat completion this is a lot though this is roughly 100,000 words or a 200 Page book which is a lot but there's a more important reason why we want to split the document into smaller chunks for document keyway task we are doing the answer to user question often lies is in some specific parts of the document so we want to only feed those most relevant Parts as context into the LMS to draw the answer from this way we can make the process more efficient and even more accurate research has shown that passing too much information and irrelevant information to an LM can actually confuse it and make it more likely to give the wrong answer now at this point we have the documents split into different pages so we have three different pages over here however this might still be too big for our purposes so we want to split the text to even smaller chunks so it could be paragraphs or sentences the idea is that as we split the document into several chunks each chunk is going to be more focused and more relevant when we query the documents to achieve this we can use a recursive character text printer and this is a function that is available from Lang chain so here we can set the chunk size or how many characters you want a chunk to contain and in this case I set this chunk to be 1,500 and the overlap between the chunks for example I set this is 200 and we can also specify the length function that is how we want to count the characters and we can also specify the separators for the chunks as I don't want to split the chunks in the middle of a word so I'll specify the separators as either page break or line break character or space and this recursive character text splitter will do its best to split our documents into smaller chunks given these parameters and constraints that we specified now we'll go ahead and run the text Splitter on our documents that is the pages from our PDF document if we run this we get back a list of chunks and in my case I have I think something like 10 different chunks and here you can see that these chunks are smaller than the pages that we had before and here's a rough visualization of how my chunks look like on the documents you can definitely play with this parameters to create smaller chunks or larger chunks depending on your case you can see in my case each chunk roughly corresponds to one paragraph in the research paper it's good to note that if the chunks are too big they might contain redundant and irrelevant information but if the chunks are too small they might not contain enough contact for the LM to generate a good answer now let me save this chunks as the chunks variable now here's an important part to query information from these chunks we need a way to represent the text content numerically if you already know what the text embeddings are feel free to skip this section otherwise I'll give you a very quick explanation of what text embeddings are text embeddings are a way of representing words or documents as numerical vectors that capture their meaning This Way Text data can be converted into a format that computers can understand and work with these embedding vectors are literally lists of numbers and they are actually huge lists of numbers you can think of them as sort of coordinates in a multi-dimensional space the specific values in the vector don't have any inherent meaning on their own but the relationship between the vectors is important similar pieces of text will have vectors that are closer to each other in this multi-dimensional space and pieces of text that are not much related to each other in terms of meaning will be further apart in this space and how do we know if vectors are close or far from each other the distance between these vectors can be calculated using cosine similarity or eidan distance we don't actually need to calculate that ourselves though there's a lot of existing functions and packages that can help us do that now there are many different embedding models available ranging from the more naive ones to more complex ones a good embedding model can help better capture the meaning of the text so having good embeddings of our text is crucial for our retrieval system to work well here we are going to use an embedding model from open AI called text embedding Ada 002 and here here we also pass our open AI API key here if we use this embedding function to turn the word cat into a vector if we look at this test Vector variable we see that this is a list a huge vector and this huge Vector has 1536 numbers and we can even see for ourselves how similar or dissimilar two pieces of text using their embeddings here again we can use l chain to calculate the distance between two pieces of text so we'll load the evaluator we'll Define this as embedding distance and we will pass in the embedding function that we have so if I want to calculate the distance between Amsterdam and coffee shop I think they're quite related to each other CU Amsterdam is really known for coffee shop if we run this we get the score which is the distance of these two strings in terms of their embeddings but if we for example say this is Paris we get a bigger distance and that means that these two strings Paris and coffee shop are less similar than Amsterdam and coffee shop so this is a quick introduction to how embeddings work and how we can use them to evaluate if two pieces of text are related to each other now as we have to create embeddings for a lot of different chunks of text we need a way to create and manage and query these embedding vectors in a smart way and this is where Vector database comes in think of a vector database like a library in a book Library we have books organized on shelves and we can find them by looking up their titles and authors and a vector database is similar but instead of books we store chunks of information represented in vectors in this project we'll be using chroma database which is an OP Source Vector database it's very fast and simple to use but it's not the only option there's a lot of other Vector databases that can do pretty much the same thing so how do these Vector databases query information exactly when we make a query for example asking what is the conclusion of this paper the database looks at the question also creates an embedding Vector for it and then scans through all the embedding vectors stored in the database to find the ones that are most similar based on a distance metric we Define it then gives you back the correspond chunks in the database that are most related or relevant to your question these relevant chunks can be later be put together and fed into an aolm for example gbt 40 to generate a good answer to the question that we asked so let's now create a chroma database to store all our Vector embeddings to do this we can use the chroma from documents function which takes our chunks and then use the open AI embeddings function to generate the vector embeddings for each junk I'm also going to specify a path called Vector store and set that as the persistent directory so that when we create this database I have a folder on my disk called Vector store that I can use to load the vector database later on the database should save automatically after we create it but you can also force it using this persist method now if we run this you can see this Vector store folder here created in our project directory let me also wrap this whole thing inside a function called create Vector store that takes our text Chunk the embedding function and the path to the vector store as arguments there's a little complication here though because if you happen to create Vector embeddings for the same document twice chroma will save these two embeddings as two different chunks so our database will actually contain duplicated documents or duplicated text junks there are many different ways to solve this issue which is out of scope for today but here I've implemented a quick filter here to filter for Unique documents and only add the unique documents with unique content to the database it's just a small thing to be aware of and I wouldn't go into detail in this video here I just use a new Vector store path so that it doesn't mix up with the other with the previous one that I created also be aware that we are using a p API from open AI to create the embeddings so if you embed a very very large documents it can be costly but so far for me it's been pretty okay for small documents with a few pages it costs almost nothing to a few cents all right now that we have created the vector database that represents the information chunks in our document we can now go ahead and query it we first load the database I'll call it Vector store again and then we can use it to instantiate a retriever using the S retriever method from langin and here we can specify the search type there are a couple of different search types that are available and here I'm using similarity which uses cosine distance to determine similarity by default this function will give the four most relevant chunks now if we call the invoke method from this retriever for example if I ask what is the title of the article we get back the most relevant chunks we can feed this into open AI model to create a high quality response using that data as our source first we need a prompt template to create a prompt with you can use something like this I specify some kind of system prompt so you are an assistant for question answering tasks use a following pieces of retrieved context to answer the question if you don't know the answer say that you don't know don't make up anything and notice there's a placeholder here for the context that we're going to pass in so this is going to be the chunks of information that we got from the Retriever and then the second input is the question or the user query and then we can create the actual prompt using the prompt template passing in the context text and the question or the user query into this prompt template so if we print out this prompt we get something like this it's going to be the entire prompt with all the chunks of information and together with the query that you asked now if we pass this prompt to our LM so that is the gbt 40 Mini model we get back a response with the answer so the title of the article is blah blah blah this is the correct answer and that's great we can certainly do all these steps by ourselves but using L chain we can also do it another way so L chain has something called L chain expression language it is a way to chain all the steps and functions together all the querying steps that we just did is equivalent to creating a chain that looks like this so we'll pass in the context so from the retriever we'll get the formatted docs which is basically the concatenation of all the information chunks together and then we have the user question as the input and then the context and the question will be passed into the prompt template to create a actual prompt and then that prompt will be passed to the LM to actually generate the answer and here if we pass the question to our chain it will give us the exactly same output and here's the fun part suppose we don't want to get back the answer as just a string but we want to get back the answer in a certain structure so that we can easily pass that into a table we can achieve that by creating a desired output structure using a pantic class pantic is a data validation library for Python and here I'm creating a class called extracted info and this class will specify the structure that I want the output to be so firstly I have the paper title I have the summary and publication year and paper authors and we can specify the desired data type for our answer so for example paper title will get the string type and publication year will get the integer type and then all we need to do is that in the chain that we created earlier We'll add the LM with structured output and then pass in the pantic class over here so that is the extracted info class that we just created also set the strict parameter here the true and this will make sure that the output structure will also be validated using the openai API structured outputs as well and then if we invoke this chain asking what are the information about the researched article including um the title summary publication year and authors it will give us the output that has this structure that we defined and what is even cooler is that we can also specify the structure of each of these items for example if I want to not only get the answer but also the sources and the reasoning of the answer then I can also create another class that is the answer with sources class and here I say I want to have the answer and then I want to have the sources which is basically the full direct text junk from the context that is used to answer the question and I also want to have the reasoning that explains the reasoning of the answer based on the S sources so if I try this I'm going to get back like a nested structure that contains the answer the sources and the reasoning for each of the items that I specified earlier now all we need to do is to put this structured response into a table format using pandas and yeah here we go now that we have everything in our app working we can clean up the code and create a nice streamlet app in case you're not familiar with it streamlet is a quick and easy framework to create small python apps and it's beyond the scope this video to explain the detail of streamlet so I'm going to speed up this section and let me now create an app folder inside our project folder and then I'm going to put all the code for the streamlet app together with all the functions we need for our app let me also create a requirements. txt file that list all the packages that are necessary to properly run our app to run the app loow locally we do streamlet run and the name of the Python file now you might be thinking how can we share our app with other people in this section we're going to talk about how to deploy our streamlet app there's a lot of different ways to deploy and share your app with the word you can choose to directly deploy our streamlet app on streamlet community Cloud which requires to make all the code public for this app I don't particularly want to do that so I'll be deploying it using Docker Docker is is a tool that can package and deploy your application in a container when you share this container with your colleagues they don't need to worry about installing all the parts and dependencies to run your app Docker make sure everything your app needs is right there ready to go also for yourself when you dockerized your app you can run it anywhere and on any operating system and also without the need to install all the necessary dependencies on your computer you might be wondering what is the difference between Docker and python virtual environments well the main difference is that with a python virtual environment you can switch between Python versions and dependencies but you are stuck with your own operating system on the other hand a Docker container incapsulate an entire operating system providing a small isolated machine in your system so when you use Docker you can swap out the entire operating system install and run your app on any system including Mac OS Ubuntu windows and so on so let me break down the key components of Docker in simple terms first we have the docko file think of this as the blueprint or the recipe for your app it's essentially a text file with step-by-step instructions on how to set up our app preparing everything it needs to run this includes things like operating system python version and any libraries your app requires and then when you build your Docker file Docker takes those instructions and creates a Docker image this is like a snapshot of your app and its environment created by following the recipe in your Docker file this image is like a portable pre-cooked meal ready to be served and finally when you run a Docker image you create a Docker container and this is like taking that precooked meal and actually serving it it is a live up and running instance of your app isolated from other processes on your computer now let me walk you through how to deploy our TriMet app with Docker firstly we want to install the latest version of Docker desktop and this will provide the docker engine and the command line interface be sure to choose the right installer for your computer and in my case I'm installing for the Mac OS and I'd also recommend to install the dock C extension for vs code and this will make our lives much easier now that we understand the basics and have all the things we need now we can go ahead and create a Docker file and you can see Visual Studio code will automatically recognize this file as a Docker file now inside this file we start with a python based image if you search for Docker python you can see a bunch of different python versions are available here in my case I prefer the python version 3.

11 so in the docker file I'm going to say from python 3. 11 like so and next we'll set up a directory in our container to to hold our app so we'll say work directory is the app folder and this will create and switch to a directory code app in our container now it's time to add our streamlet app to The Container so I'm going to add all the files in our app to The Container this dot represents our current directory and this dot represent the app directory in the container so this copies all the files in our current directory into the app directory in the container next up we'll install the requirements we say run and then pip three install and- R and requirements. txt and this will install all the python packages our app needs in The Next Step we'll expose the port and streamlet typically runs on Port 8501 so we expose that this basically tells Docker that our container will use this port to run it on our browser optionally we can add the health check command to tell Docker how to test a container to check if it's still working and finally we add our Command to start our streamlet app so here we say streamlet run and the name of the pi file for our app that is the streamlet app.