Comparing 10 different models, including Gemini Flash 2 0, Grok, Claude, GPT, Llama for OCR

2.18k views5434 WordsCopy TextShare

LLMs for Devs

Code: https://github.com/trancethehuman/ai-workshop-code/tree/main/projects/ocr-battle My deep dive...

Video Transcript:

yesterday I did a workshop in person about uh using large sandwich models in Vision mode to uh do OCR on documents and screenshots so I compareed 10 different models from you know Gemini models to Gro to uh anthropic claw and GPT mods GPT 40 40 mini to see which one did well which one didn't um so some of the examples that I used were PDFs like you know attention is all you need some rack papers those PDFs but upside down to see which the model well and uh screenshots of my screen including code and you know maybe code and PDA at the same time and of course um some scan PDFs to see if uh that will matter to the performance and then finally I used um lsmith which is a eval and tracing platform from L chain to be able to score uh the results from these models and then compare them to one another using an llm as a judge so uh stay tuned I mean this is the video so guess let's get started we're going to be test using the same proms for all the models to do opcr this is kind of not Kosher because every model like the way that they are um prompted is a little bit different you know you jump from like GPT models to like llama models your PR has to be a little bit different otherwise you won't get the best performance out of each one and this is where you can use something like dpy or like you know few shot it so that you can get you can squeeze like the the last 10 5% out of your performance but this is a quick test this not very scientific I just want to show you guys how easy it is to set up these clients from these models and then see some quick examples so this is the prom that we'll be using um let me see can everyone see my screen cool so if you can read it uh you can expert at OCR extract the content from images um if there are charts draw them in a markdown friendly format if they're tables just do format in markdown um do not wrap everything in a markdown tag and then now go and do the thing that's basically the system message for OCR pipeline any questions make sense okay cool all right next slide uh see I love these notes for myself I'm glad I wrote These okay so let's go through one by one um every provider has some wrapper code let's go with anthropic claw first for anthropic claw um you got to do like pip install anthropic it's like anthropic has an SDK for python I'm not sure if they have one for JavaScript or typescript but python for sure so you import it and then you set up your API key this is how you create a client um and this is how you can start calling this model so little literally just this amount of code so um I have this here because I'm using this tool called lsmith and I apologize because this is the third Workshop in the series about evaluations if you missed the last two we use this tool called lsmith to be able to trace our calls to LMS it's basically observability 101 for like traditional softare engineering but um some people don't have any observability over their system systems um but you should because otherwise you wouldn't know which calls are failing which calls aren't doing well um yeah so if you're new is this your first Workshop I apologize but just remember that this is just to so that we can get this call onto a platform so we can like look over the logs eventually and figure out whether or not our system is behaving properly so for Claude um we passed in the system prom and then in our messages CU we just say message from user uh content is Type image and uh we is pass in the image data which we have um locally um and I will show you what images I'm showing to these models eventually um so that's pretty much it and these are just the code that I use to like test this one part of the the provider um and for Google um Google is very interesting because because traditionally um Google vertex AI or Google SDK is very I wouldn't say very hard to set up but like it's not very developer friendly but um thanks to the PM teams at Google we have a way to set up Gemini client very easily which is through open AI so you install open Ai and then you pass in this base URL which points to Google and you pass in your your Gemini API key and now you're calling Gemini models and your code doesn't not have to change anything uh your your functions signatures everything if you were using open AI before you keep this literally the same code you just point it to Google through open AI we're going to trace it and then we're going to just you know set up our um basically this is the same code for open AI because we're using open AI SDK um system message the same one you're an expert doing OCR um and then we'll pass the URL at the image in that's about it pretty straightforward pretty simple um we always set our temperature to zero so that we can reduce in amount of like randomization um when our models generate stuff is there any questions about anything everyone feeling okay anyone if anyone's confused just let me know I will talk you through it um and this is just a code to like test just this individual provider um so nothing super special here here and then um for a llama model so uh meta AI research just dump this model out to the world um they don't provide you with inference uh infrastructure so providers like together AI or Amazon Bedrock I'm I'm pretty sure or even Azure um they will are the ones who are hosting out of the box llama models for you to use so you can pay these providers and and then have access to llama models why use llama models well they are open source open weights you can fine-tune them if you want if you really care about private AI fine-tune them host them on betrock I don't know somewhere um Amazon I I don't host my own model so I'm just making stuff up here but essentially you run this on your loc or exactly how can I forget about that you can run this on your computer locally using AMA whatever um fine-tune it um basically you can air gap it and own your own data own your own model um all that good stuff but if you just want to use it out of the box you can go with together together Ai and uh pip install together again these providers create these SDK that are very similar to one another because they don't want you to have to figure out how to change your code to be able to start using their stuff this is um together AI again very similar um setup and we go to some test kind of like functions down here nothing too special open AI this is the most straightforward and probably most people in the room will be familiar with um you set up your open AI client and then um yeah you pass in like this user a message image URL key url url um you can pass in the binary as well um open ey also can deal with that and other than that uh testing function nothing there last but not least my personally I personally was very curious about this one because uh I keep hearing about xai and like you can use it on Twitter or should we call them X now um you talk to Gro Gro uses this model as far as I know so again open a SDK so like if a company is a challenger company that came out recently or if they want you to not have to change your workflow it's most likely that they will allow you to do that through open AI SDK because they know that most people use open AI so you just have to install open AI SDK um and then API key pass in xai API key and then make sure your base URL points to um XI API in point um and this is how you send a message Mage that again literally the same way as open AI there's no difference there um there a settings here and um just a test function down here because I need to test them individually one by one that's it that's all the providers um for a provider like uh like Google Gemini it's actually Four models under the same providers cuz they all gini models um cool right let's go back to the slides uh okay um one last step before we run the experiment um this is my utility function so all this is doing is it's pulling it all the providers and it's going to construct a way to for us to visualize it on this platform called lsmith which is the tracing and observability platform um it's cool because what's going to happen is I run these model against my test cases and then I have another LM model judge the response from these models um of course it's not going to just be like is this good or bad so this is how it works um actually let me click my slides real quick there we go I remember this part okay did not miss anything this is how we're going to judge it um so essentially these are the images that we'll be testing I'll just go through them one by one maybe bottom to top so this is just a you know your runof-the-mill PDF uh this is one from one of This research paper about rag we can see there a diagram which makes it a little bit you know challenging for most OCR pipelines um and we have two columns which if you ever done like PDF OCR and you see like columns usually it ocrs like this which makes no sense because the text comes out like in gibberish but what people realize over the years is that the best way to extract text from a PDF is to use OCR not like a hardcoded kind of like text extraction pipeline like Pi PDF in python or some kind so and we also have another test case for screenshots so this is just a literally just a screenshot of my screen I want to see if the models can actually extract the cod in the middle um that's why I include this and I want to take it a step further and be okay what if it's like code and PDF so you got code one side got PDF on the side so this is just a screenshot um what use case would this be um everyone here familiar with like um agentic workflows like you know agents that control your computer kind of stuff anyone here not know what I'm talking about okay cool um anthropic came out with a computer API recently and this has been done for like many not many years many months now um uh essentially there are Vision models that are fine-tuned so that it'll give you bounding boxes of elements on your screen which can then be passed back into another model to make decisions of like what to click on so if you can for example you can say like uh I want you to write me an Emil email to my friend Michael about a party next this Saturday it's going to take first of all take a screenshot of your screen to know where it's currently at and if you can find like you know the email icon it will the theine tun Vision model will be able to draw a bounding box right here and give the coordinates to the next model to be able to click on that as long as you allow it to have permission to click around on your screen and it'll just keep running that and running that until it's able to open your email you can see the email screenshot you can see like which panel and which button to compos in your email and it just keep doing that until it can able to send an email to your friend Michael or not because like apparently anthropic found out that if you just keep it l lit run it will go on Twitter and like start tweeting about random or like you know doing some other things um so it's not I wouldn't say it's fully production ready that was the uh one of the examples and then this is now again just a screenshot of this paper attention is all you need we just Kickstart all the whole GPT thing um this is back when Google was sleeping on Transformer and I want to take that a step further I want an OCR upside down so people know this uh by now but like Vision models do better when you flip the content in the right orientation but what if I don't you know what kind of information because like how would you know if your El your PDF or your image is upside down you probably need to have a separate step to check for that which at latency and cost right what if you just later R all right so that's how we test actually one more thing one more thing I want to add one more thing into how we test this um we're going to be using Lang Smith to like uh to visualize our test and what I just uh mentioned before is is called uh llm as a judge so essentially using LM to judge the output of another LM um usually you would want so traditionally uh mle vals you can have a reference um value so like maybe like a true false Boolean value once and zero and then you can check against that value so to Ben Benchmark your ml models but what if you want to check like oh I want to see if my model or my chat bot is responding to to the user in the kind of tone that I want which is like caring or like you know be professional you can't really check that against like a a set number so that's where LM that's where traditionally people would annotate these by hand but just to give you a sense of what's Happening Here so every time the model gener uh looks at the image and then try to extract the information from it I will actually let gp4 compare the extracted uh information to the reference information which I have personally guarantee to be accurate um so I wanted to see if like there's so much deviation in terms of like formatting or like did it make up words did and you you'll see like it will make up stuff especially for the smaller models um essentially um this is the judge this is how you you can basically create a judge by prompting it so like you are an expert at grading a student like basically I'm trying to create you basically tell me more so if you're if you're a professor a teacher somewhere you this will come very naturally to you so like you're an expert at grading a students um uh I wouldn't say paper like the more role playing you give it the better like from my experience but I'm just saying saying like you know just from these four criteria like is the response similar to the one that we're referencing uh is the formatting is the is the order of the information correct and whether or not like there are information that are not in the reference material that the model is making up and on a scale of 1 to five this is what we call LM as judge it's like a pretty hot thing nowadays for LM eval again if you're if you're not doing eval your stuff is not in production um you have to do eval otherwise like watch your videos like you said I think that was your last one you said yeah last if you're going to use AI it's like how do you trust the response is something that you could then send to a user exactly yeah because you can see over time um and that's why you always see like these chat interfaces have like thumbs up and thumbs down because that information gets fed back to here into The annotation queue and then you can over time you can see like not just that you can eyeball it which is like what I mostly do but you can fine-tune it using user preferences because you have records of like okay user like this response it didn't like this response you keep all the good ones and then you just fine tune your model to be more like that um if you want to fine tune you can do F shot so F shot means you're just stuffing like the the correct examples into the prompt especially to the message CU so that some people call it like in context training or like um yeah but it's a it's a form of like simple fine tuning okay that's our evaluator and this will run every time a model makes a call to get extraction um and that's because you're using Lang Smith and Langs Smith is like hooked into your yeah pretty much so lsmith makes calls in the background after we make the call to your model so that it's not blocking our application okay yeah so and that's all just with those um context things before your method yeah so let me show you one more time yeah so all you got to do is from L to pip and still lsmith um and then from lsmith import traceable and then you start tracing these these functions uh right away um yeah just a decorator and that's it and it then triggers that exactly um I'm just going to run this but this takes a while and and cost me money but it's fine I just kickstarted our entire test so what's going to happen is that we can go and look at lsmith and we see that okay these models are running um and we can give it a couple seconds to so that we can eventually compare but uh quick overview of lsmith real time tracing this m is running right now Lama 3. 2 90b um loading and if you go back to our terminal we can see that it's right now doing um the uh code screenshot PNG and we've run a couple of different providers so far it's going to take a while so I'm just going to show you historical traces uh I kind of wish this front end is not like half of it is like just headers can't even see I got to stop complaining about these things cuz like I'll post this video after and then I have to edit them um got to be careful about my words but okay because we have an evaluator that's an LM it could give a score I don't know if you guys can see the score here this one got a 4. 5 going go of five four four so those are the scores that the LM has given the models in the past um what's happening here is that we've set up a what we call an online evaluator there there's a difference between offline eval and online eval online eval is and this what most people do honestly especially software developers traces come in and we grade them right there offline eval means you can create a data set afterwards and this is a traditional ml thing you can create a data set afterwards from these traces you take it to the side you can create like a a panda frame of some kind and then you can run your own evaluator over offline data sets online just means that we're going to greet them as soon as they come in and you can see in a couple seconds that it's going to pop up some scores here so from testing from what I'm looking at Google Gemini flash 2.

0 hands down the best um I'll show you some some examples um let's see how about the cost the I just the about doing the cost stuff but the cost changes so much even like yesterday open AI drops GPT 40 cost by 60% I'm not 60% 60 time0 60 time um so one day the model could be cheap the other day the other one could be cheaper like at least for me I was able to decrease my cost by like 200 times just by switching from GT4 to Gemini 1. 5 flash but that's no longer the case cuz it's like 60 times less now like it's just this cost conversation is hard to keep up CU like everything's just changes so fast but I want to show you the difference between um some of these models for example this one got a one it did not do well at all and these are both Lama models doesn't matter if it's like a big model or small model and we can because we set up an LM as a judge we can also get the reason why it graded it that way so you can click on this one you can see that you know Student Response contains elements from reference but it's largely inacurate blah and it's got you know hallucination stuff like that so when LM was able to point out that comparing the reference to the output of the model it sucked that's why you gave it a one um so just be careful about that sometimes models don't do well at all for certain use cases and this one is a heart one because it's like code on the left and PDF on the right so that's like one of the more difficult ones great question so the reference is um I just copy the text out of my hand and just check every single word so that I can treat it as a source of Truth so you manually did it yeah so it's me like going to the PF command C command V and then make sure every single word is accurate and then we can keep that as a source of truth that we can reference back to yes the reason why we Trace everything is because and have these source of truth because if you one day switch from GP 40 to Gemini how do you know it's not going to break your systems right so you have a suite of tests with a bunch of test um ground truths like this you can run the same pipeline against this again have an LM eval again and then if the score goes down then you know that oh something is breaking even though on the news they're saying that this model is better but for my system the model is breaking certain things that's usually what happens when people move from like CH4 to like CHT 40 cuz it was better in certain ways especially lat latency but it was worse in certain ways like especially for reasoning um okay so let me pick another example here um Gemini flash actually let's just filter by use case so I'm going so I already set up tax so I can just filter them by tax so let's say we only want to compare by Bill Gates resume that's one's a good one so just to remind you this is what Bill Gates resume back in the day looks like um Bill Gates resum here we go so obviously used the typewriter I believe um so not as bad as like handwritten PDFs but obviously you know scanned does not look perfect has some like you know marks on it it's like off-white kind of yeah so Bill Gates resume Will William H Gates that's what he called himself cool so we go back to lsmith we're going to take a look at the how the MS did did with the Bill Gates resume um so Gro Gro seems like did okay got a four let's see why we got a four here so students response is similar but not verbatim so not one to one um formatting is slightly different Minor error in the phone number um a little bit of inaccuracies there um that's why I got a four not a five um gp2 40 got a 4. 5 which is pretty close to perfect seems like everyone's got a a little bit of error in the phone number um this one's got a five Student Response is almost identical format is consistent um except for like a little you know error there so again Gemini flash uh 2.

0 the goat um the only thing about Gemini flash 2. 0 is that it's still experimental so it's rate Limited you can't even pay for it that's why I don't put that into production yet I'm still using Gemini flash 1. 5 Gemini flash 2.

0 is actually better than GPD 40 and we we don't know how Gemini Pro one 2. 0 is going to be probably as good as 01 or 01 mini um okay let's see uh one last example here what if we look at the test case for um this PDF here see if uh who did not do well okay so we got some threes here um llama 3. 2 90b Vision apparently did not do too well um the smaller one is better for the 11b oh that's that's interesting let's take a look then okay here's one cool thing about lsmith so you can open up a trace and you can start hitting compare and you can start compare it to another Trace which is this one all right everyone see my screen yeah the the UI is a little bit uh weird but essentially we have 90b on this side and then 11b on that side and then we can take a look at how the traces are actually um traced so as you can see this is our PDF pass it in as an input and then we go down to um our outputs sync but it's not syncing okay output there we go all right so right off the bat we can we can see that on the um 90b version it's got these texts that the 11b does not have um yeah I don't know where that text is coming from we're going to check the PDF real quick um let's see I think it's this one okay so I guess um the warer we just saw were just like these like little oh yeah red areas and it just kind of like put them there without telling us whether or not it came from like the the diagram and I guess for a source of Truth um for a source of truth it's actually um a mermaid CH chart of like all these things so like both of these models did not do perfectly but from the point of our grader why the heck did you put it like this so I got deducted a point which again this test is not perfect like obviously what we can do is we could could have prompt it so that it could be like okay so if it's a diagram like put it either put it in a list or tell our judge that you know as long as the text is in the response is fine doesn't have to be like in a mermaid chart mermaid chart is just a way for you to draw diagrams using like I guess code um but yeah so that's this this is where you you have to like play around with like the chch's prompt and like make sure it's like perfect um and another thing that I found about these small models is that the big model knows that the text is cut off here but the small model would hallucinate and like put like Words at at the end if it doesn't see like you know the text in in the logical way so for example [Music] um let's see so for this right here um it ends with the word the right here and the small model is like I want to put something right after CU it doesn't feel right so that's what you get for Sometimes using a small model it'll Mi stuff up um do you have a previous run of the same test where like would we see the same ratio where 11b did better in the previous one is it consistent would you say or we see like four four score or something like yeah that's a great question um I want to say two things for that so one uh to make the test better obviously you want to aggregate the score over like Sero of test and do it a second thing is um this is something that I learned from uh this guy uh Sam [Music] whine um because I had him on on a podcast recently and uh he said that if you want better performance out of your OCR just run it like three times and then make the model reconcile the differences because it it's most likely will not hallucinate more than once it will hallucinate once and if you want to be super sure just ask the all to to OCR the same thing like four times and just be like okay this is the four versions of the same thing just make it to one version and if the error happens once or twice it will be able to correct that but obviously your 4X your cost and 4x latency unless you run them in parallel um you know what I've learned recently is that once I start running these tests relased those features so what I learned recently is that once I start running these tests you kind of have to run your own test and you listen to like advertising but you also have should probably do your own eval set um yeah I know that I'm just saying in general you should run your own eval set um because I was looking at different search providers for llm and turns out the one that has the fanciest landing page did not do well at all um so I learned my lesson there and this test was was replicated by some other friends of mine as well same results and we had had some some fun conversations with that um but yeah um that's pretty much it um I'll share the cat with you guys after today it's really easy just go and try like um I want know how many how many providers should we try today uh we got um let's see 1 2 3 4 five 6 7 8 9 10 10 options for your OCR needs and I run the test and I will send it to you after make your own choices make some informed buyer decisions be a smart consumer how much money you C this oh it's nothing uh this is so far has been really great the Gemini flash uh 2.