How Docling turns documents into usable AI data

24.02k views2202 WordsCopy TextShare

Red Hat

Wanting to use your personal or organizational data in AI workflows, but it's stuck in PDFs and othe...

Video Transcript:

Hey, so I think the true value of AI, and specifically large language models, comes from the ability to use your personal or organizational data, right? I mean, that's why we've seen techniques such as Rag or retrieval augmented generation become so popular recently. And these models are great at working with text and markdown.

But a lot of useful information is trapped into formats like PDFs or proprietary formats like doc X, which have a complex structure. There's nested elements, there's no standardized layout, and have different formatting and table structures. And so, needless to say, it's not easy to work with these types of documents in AI workflow.

But that's where Darkling comes in. It's an open source project from IBM research, and it can parse popular document formats and export them into markdown and Json with context aware techniques to preserve the original documents integrity. So what we're going to do today is check out the project.

We're going to convert over some PDFs into pure markdown and integrate it into our AI workflow with Lamar Index for a question and answering based application on our own unique data. So let's get started. So welcome to the Darkling repository here on GitHub where we can learn more about the toolkit.

Also point you to the documents, examples and Integrations page here on the repository. But let's learn a little bit more about the project. For example, the ability to read different types of formats and convert those into markdown and Json to be embedded, for example, into Raag frameworks, but also to, scan PDFs and be able to convert that and also use not only a CLI, but this library in our Python applications.

So there's a lot of capabilities here. And it stems from this research paper that we'll take a look at here. So what I want to note is that Darkling essentially comes from the need to have a way to be able to pass these different types of document formats.

And it's always been difficult due to variability in formats in this week standardization. But now with Ela, Lims, we want to be able to pull this information and be able to use it in our AI applications. And typically this has been tough because of licensing or paying for large language model inferencing to run and parse these types of data.

But now we have this open source tool with Doc Lang, which is state of the art. And we can take a look at how this works in this kind of architecture diagram here, where let's say we're parsing a PDF and images. What happens is we're using a little bit of OCR object recognition magic, to do some layout analysis.

So let's say there's a graph and it spans from page one to page two. What we want to be able to preserve the integrity of that table structure, for example, and then assemble that document into this standardized Darkling format, which we can, use in order to export into Json or markdown, or also use in a chunking situation with a vector database to be able to run question and answer on our documents. So not only PDF, but different types of these formats of doc X and PowerPoint and Excel.

And this is how this pipeline works in order to parse these types of documents. So this is all done with Darkling. Now what I also want to show you, in addition to the architecture and a little bit about the models that it uses in order to do this type of recognition is also the performance here.

So what the Darkling team did is actually took together a couple different open source projects and ran them on two different types of systems. So we have an x86 architecture with L4 GPUs, but also a MacBook Pro with ARM architecture. And what they did is essentially processed a large amount of pages on darkling marker minor you and unstructured.

These are popular open source projects for doing this type of data parsing. And what they saw, when they did this comparison and benchmarking on the same exact system is that lower is better. And so you can see in this graph that Darkling was leading with 3.

1 seconds per page on an x86 and 1. 2 seconds on an M3, followed closely by minor. You, but it's important to note that minor you wasn't able to finish any runs on an M3 max on ARM architecture.

After minor you was unstructured and then finally marker, which took about 16 seconds per page on an x86 CPU. So it's just a little bit important to understand how these different tools are parsing, these types of documents. And there's a lot more information that you should definitely take a look at in this Darkling technical report.

But let's go ahead and install the Darkling CLI and start converting some documents. Okay. So let's go ahead and install Darkling.

So I'm going to go copy this command and just hop into the terminal. It's a quick pip install. And as you can see I've already installed the the dependency in the library here on my system.

But I also want to show you some of the more, detailed information for the CLI. So you've got all of these different options in order to, format the documents from, a folder, for example, or from the web, and to export them somewhere to use OCR. And you have different choices for easy OCR or Tesseract.

Or maybe not use OCR at all. And so let's go back to the usage page. And here's the example of a simple command that we're going to run.

I'll hop back here and let's do the Doc Darkling conversion and also specify let's say our desktop. Yeah. So we'll run this command.

And as we talked about before in the research paper, it's going to take a few seconds per page in order to convert it. But what is actually in this page. So this is a PDF.

And so this is a academic paper just like the one we saw earlier. But you're noticing that there's different types of headings and subheadings. And this will all be formatted into markdown.

And I'm curious to see how it's going to look. Right. Because we have so much information here on this page that traditionally it's very hard for, an AI model to be able to parse this and to associate all the relationships between text here and be aware of the context, in that sense.

So I also want to take a look at here how it's going to format the images. And also what's really interesting is going to be how it takes this table here, and actually formats that as well. So let's go ahead.

Had back over to the terminal. Looks like it's finished. Doing that conversion.

And on our desktop we've got this 1. 5MB markdown file. I'll make this a little bit bigger, and we can see it's already started to format the headers.

Some of the subtext that we just saw on that page in that PDF. And it looks pretty nice. So, all of this has been nicely formatted, links.

We've got examples of the figures being converted into base64. So the images there and I'll go ahead and scroll past that because that's a lot of text, conversions with some of the font, keywords. Let's see.

I'm also curious. Oh there's that table. So it looks like it's done a pretty good job here, actually, of, of converting it into a AI readable format.

Right. But let's say that we want to integrate this into our application, where we can actually use Doc Ling's document converter, for example, in this notebook that we've created. So I'll go ahead and just run this, and the source that we're using here is just the PDF of the technical report that we saw earlier.

We're using the converter and then just converting this straight into markdown, which is now going to be this full Doc Ling technical report that we saw earlier. And just like that, it's simple to use it via in the command line or also in our applications as well. But let's dive a little bit deeper okay.

So now I want to show you a really fun part of using Doc Ling for reading and parsing those documents, as well as Lama Index to be able to retrieve and generate answers that are really efficient and useful based on that unique data that we're providing. So let's go ahead and get started. So first off, we're installing all of these necessary packages, in order to, work with Doc Ling and Lama index and to parse these documents, then we're going to import some necessary modules.

And this code is actually from Doc Ling's documentation itself. There's code like this for Len Chain as well. But the main part and functionality is right here where we're defining some of the main parameters.

So we're importing a hugging face token. And we're working with two different models. So an embedding model in order to, convert some of the document into nodes and actually use this for response generation from our vector database, and also a regular gene AI model in order to provide a, natural language response.

So let's go ahead and move on into the Rag pipeline. So what we're doing is importing some of these core Lama index and Doc Ling components, like the Doc Ling reader, among other readers that you can get within Lama Index. And what we're going to be doing is processing this document, and extracting some of the structured text, and doing this into notes.

So we've got, a reader to convert this input document, just like what we did in the past example. And a node parser to convert it into nodes or different sections and paragraphs to be logically extracted when we ask a question, as we'll do later on, which is what are the main AI models that Doc Ling uses? So let's go ahead and take a look at the vector store.

Right. This is important for converting our information into vectors so we can perform a similarity search. What we're doing is essentially creating an index.

So we'll be able to kind of sort and process through the document from the source here, which as we have here is this, let's see what it is. Drmroll. Doc clean technical report again.

And then we're going to convert this into nodes. And transform this, so that then we can embed them, and save them into this vector database here. And so the result is going to be.

Oh, and I should have cleared this right here. But I'll run all of this again and we'll be able to see what the output is going to be. So I'll clear the output there.

And we're going to ask the question again. But what should happen is we're going to ask the question, hey, what are the main AI models. And based on the similarity, of these nodes to the original question, based on the embedding document that we used, which is this small model here, we pull that information back and generate a response with the open source Mistral model.

And just like that, we have the answer, which is, it's powered by two different models doc Lee net and table former. And the sources are provided here, which is the ecosystem part of the paper that we saw earlier. So this is just a really cool example of how you can use Darkling and a Rag pipeline, for example, with llama index or for example, with lung chain or extract this information and actually start to fine tune models as well.

But let's head back over to the repository and we can close out. So I definitely see a lot of potential in this project. Personally, I've been using it in order to convert PDFs into markdown, to use it with another open source project called Instruc Lab, which does fine tuning of models.

And then at that point, with the fine tuned model adding rag on top to pull those dynamic data sources. So I think not only with Rag, but also for fine tuning and just generally being able to work with your data in a AI workflow. Doc Ling is really impressive with its performance and benchmarks, and I highly recommend that you try it out, and search it on, Google and GitHub and see the paper and, and get it working on your device and let me know what you think about doc and get it working on your device and let me know what you think about doc and get it working on your device and let me know what you think about doc link below in the comments or any other future videos you'd like to see.

As always, my name is Cedric. Thanks so much for watching and I'll see you in the next one. Bye.