Best Model for RAG? GPT-4o vs Claude 3.5 vs Gemini Flash 2.0 (n8n Experiment Results)

3.79k views4400 WordsCopy TextShare

Nate Herk | AI Automation

Which AI Large Language Model is Best for RAG Agents? In this VERY casual experiment, I put OpenAI's...

Video Transcript:

wow flash was 6. 7 seconds GPT was 11 seconds and then open AI or anthropic CLA was almost 21 seconds so that's that's kind of a big jump here so we all know that rag is so so important when we're building out our AI agents but the question is which large language model is actually the best for rag So today we're going to be answering this question by putting these three popular models to the test we've got open AI GPT 40 anthropics Cloud 3. 5 Sonet and then Google's Gemini flash 2.

0 so so make sure you stick to the end of this one so you can see who takes the throne so real quick before we get into the experiment I'm going to keep this really high level but we have to understand what rag is so rag stands for retrieval augmented generation and it's basically just the process of your agents going out to get information that it doesn't already have in its sort of training data or it's prompt so the r is retrieval it's going out to retrieve the information the a is augmented it pretty much takes that information that gets back from the vector database or whatever database you're sort of having it access it augments that data and information with other stuff that came through in the query and the prompt and then finally it's handing that off to the large language model to create to generate the response that makes sense to the human based on the context of its role and also the query that originally came in and so we'll hop into nadn and you'll see the three different agents we have and we'll actually run through this process it'll make a lot more sense but real quick this is kind of what it looks like the process starts with a user submitting a query to the large language model that model then creates a prompt to actually feed it into a vector database as you can see right here within the vector database that prompt sort of gets vectorized and then the nearest vectors around it are pulled back right here as you can see that's where we're retrieving content and then from there that content is fed back to the large language model to actually generate response generate a response for the human to read and then um at that point it's just a matter of other any follow-up questions if yes it happens again otherwise the process ends so let's hop in NN we'll take a look at the three different agents that we have and we'll walk through sort of the data that's in the vector database and we'll start testing out these agents and seeing which one's the best so the idea of this EXP experiment is to keep everything as consistent as possible we're going to limit the variability as much as we can by doing the following things we're going to test eight different parameters or sorry seven different parameters information recall query understanding um response coherence and completeness speed context window management conflicting information and Source attribution so for each of these tests that we're running we're going to be feeding in the exact same prompt to each of the agents and then evaluating their responses as you can see within each agent we have the exact same system prompt which is basically your an agent responsible for retrieving and summarizing nvidia's financial and earnings information that's what's in the vector store that we have so we're telling it how to call the tool what it does and I mean it's very very simple that's really the only thing this this agent is doing is calling its Vector databases um as you can see all these guys have the exact same prompt so pretty much feeding in the same information running it through the same prompt and then running that prompt through the exact same Vector database with the exact same information and then as far as actually grading the responses I had done this once before and it was kind of hard for me to like score for each response out of 10 and um sometimes there was like a Baseline and I had to compare different models responses so what I'm going to be doing just to keep this consistent is I'm going to be running the responses through um chat gbt that has access to the original PDF that we have in all of these Vector databases here and I'm not going to say like which response is from which model but I'm just going to tell it to give all three of the responses sort of like a rating out of 10 and so we're going to do that with all of the responses coming back just so it's like a consistent grader and we'll see who has the highest average score at the end before we actually start that experiment we're going to do one example to see how this process works um sort of going back to this diagram but let's actually see it within nadn so I hooked up the chat trigger to the CLA agent and in the chat trigger I'm just going to say what was nvidia's qy1 fiscal 2025 data center Revenue we'll see the process happening obviously but what we are interested in here is looking at the log so once this finishes up we will click into the agent and look at this log and see what happened so the agent looks over its system prompts as well as the actual query that it just got from the human and in order to decide what it needs to do next in this case it says to answer this question accurately I'll need to use the Nvidia tool to retrieve the most up-to-date information about nvidia's earnings report let me do that for you so at that point it takes our original question and it turns it into a query to actually send off to the vector database so it sends off this query what was nvidia's q1 fiscal data center Revenue in this case it didn't really change our query from the human query but sometimes it actually does um and then it gets a response back from the vector database which as you can see we've got some information here and so this process looks like hitting the actual Vector store sending off that query to the vector store looking through the information using the embeddings model to once again look through the embeddings and then it gets an answer back and then with that answer that it gets back from the vector database it uses its model again claw 3. 5 Sona in this case to actually structure that into an output that's human readable and that makes sense to us okay I hope that quick example paints a little bit of a better picture in your mind of the way that the agent and the vector databases talk to each other how the query is sent off um but let's get start with the experiment so the first section here is information recall I've got a prompt ready which is how much did nvidia's Gap operating income grow year-over-year in q1 fiscal 2025 so we're going to send this off and obviously this is going to be the same prompt that we use when we're testing um gbt 40 as well as Google Gemini flash 2. 0 so here's what we got from Claude nvidia's Gap operating income grew by 690 year-over-year in q1 fiscal 2025 it breaks it down a little further and then it sort of gives us a good recap at the end it's structured very nicely it's very human readable so right here we've got the Gap operating income it's up 690 year-over-year right here as you can see and then it gave us the figures of 169 and 214 which we can also see right here so it is correct and we're going to enter this as one of the responses into gbt to grade so we saw that the response coming back from cloud was correct I copied this over to a GPT and let me just show you what this looks like so you guys understand how these are being graded once again I'm not like a scientist or an experiment expert but I just thought that this would be a cool way to keep things consistent and leave it less up to my own inter interpretation anyways I gave it the PDF I told it to read it and now I'm going to say for each test based on the PDF give each response a rating out of 10 um and so here's response one and then we're going to grab response two and three from um these different agents and then we'll see what what the the GPT says as far as giving them a grade so anyways let's send this same query off um right here so we reposted the message to the GPT 40 agent and we'll give it a quick scan I'm not going to read over every single response from every single model um for all different seven tests because that would make this video way too long but as you can see here we got Gap operating income grew by 690 year-over-year in q1 fiscal 2025 so that is correct however you know you can already see that anthropics response was better just because it was a little more detailed and then finally let's hook up our Gemini model real quick we'll hit retest or sorry repost message and then we'll see what Google comes up with as far as our answer that one actually looks like it finished up really quick um and they actually gave us complete identical results so this will be interesting to see how chat gbt grades these okay this is interesting um even though response 2 and three were identical um they got different grades so anyways anthropic got a nine open AI got a seven and Gemini got a six so in here we're going to go 9 7 six just to keep things consistent um you know like I said this is not a perfect experiment but we're going to be throwing everything into there so hopefully it's at least a little consistent so let's move on to the next one which is query understanding okay so it's also going to be important to switch up the order a little bit as far as the responses that we're putting into chat gbt so this time we're going to start with Google Gemini so he said tell me about growth in nvidia's segments and how is nvidia's dividend policy changing so it's kind of vague and it's also mult it's also multi-art um but here looks we got a pretty good answer back so I'll copy that over and then we'll move on to open AI all right firing it off to open AI now okay so we got the response back one thing I've already noticed is that Gemini is you know much quicker so that's kind of cool when it comes to maybe the speed test we'll see there but here's our answer from um open AI it structured a little differently but let me copy this one over to our grer okay we're hooked up to Claude let's send it over um I think this is already super interesting just the way we're seeing differences and responses um the way they go in depth or not and um also if we were to come in here and look at the executions and look at the different queries that are actually being sent over from the agent to the pine cone Vector store we'd see some differences I'm sure okay so we just got claw's answer I'm going to check the PDF make sure everything's accurate and then we're going to send off to The Grater and see what we get back for these scores okay and while the grer is working I just wanted to look at this real quick so in this log we'll see that it hit Nvidia a couple times so it sent off two queries the first one was what is growth in nvidia's different segments second one was how is nvidia's dividend policy changing so you can see that the agent took the query that we sent and broke it into two different queries and now what I want to do is look at the executions we'll go into the second most recent one which was open Ai and we'll see the way that open AI did it too so click into the agent let's go to logs um so it did the same thing the first one is what is the current growth in nvidia's business segments and then it said how is nvidia's dividend policy changing recently okay so that's cool and now let's quickly look at the way that Gemini did it and as you can see Gemini took 4.

9 seconds open AI took 10. 9 and then Claude took 18. 2 so also just something interesting to note we'll look right here and Gemini only went once to the vector store so it said tell me about growth in nvidia's segments and how is nvidia's divid policy changing so do with that as you will but I think that's kind of interesting to look at anyways let's see here so one was Gemini two was open Ai and three was anthropic so um you know it's pretty similar but interesting to see here so for round two we got 798 it's a pretty tight race um open Ai and anthropic right now are tied which is interesting but Gemini is not too far behind and we have speed coming up which as you can see it's been pretty killer so let's move into response coherence and completement okay this time we're starting with open AI I'm going to paste in the query here and send this off and I'm not going to show you guys the process from now on of me doing it with every single agent because that would just take too long the video would be too long but um I'll show you guys all the responses and you how they're graded but anyways for this one we said summarize the key financial highlights of nvidia's q1 fiscal 2025 result so I'll let you guys know okay we just got the grade back so if you want to like you know pause and see what it's talking about you can but the first one was open AI the second one was anthropic and this one was Gemini so I'm going to update our scoreboard okay now we're moving on to speed which is exciting because I think that Gemini May pick up some ground here so let's hop back into NN and send off these prompts okay we're going to start with Gemini here and I said summarize in video's q1 fiscal 2025 growth drivers and how they plan to address the generative AI Market that was really quick um and what we're going to be doing here for this one is actually just looking at the execution time and then um assessing based on that so we've got this um let's send off the same question to um what's his name open AI so we'll hit this off um and I know you know maybe it's not a perfect experiment because we're not actually evaluating the content in this test parameter but this one I just wanted to send off like a multi-step query and just see how long it takes so we've got that one now and let's go to anthropic finally um and send this one off and so you know one time I was doing this experiment kind of just for fun between Claude and gbt and I was using like the timer on my phone to actually do this but I realized so much smarter to just look at the execution that has the actual milliseconds of of speed so yeah okay so that one just finished up we'll click at executions and we'll see wow flash was 6.

7 seconds GPT was 11 seconds and then open a or anthropic CLA was almost 21 seconds so that's that's kind of a big jump here okay so here's my reasoning I decided to go flash 10 open AI 8 and anthropic 6 because um open a was basically double the time it took for Google and then anthropic was basically double the time it took for open a so that's what we're going to go with let's keep going down this experiment so for test number five we've got context window management and I didn't know a great way to test this so I'm just going to say summarize as much of the earnings report as possible and we'll just see how much comes back and how accurate it is and um we'll give the grade based on that okay here's what we got from anthropic let me paste that into the and then we'll move on to open AI now we're sending it off to open Ai and then last but not least let's hook up our Gemini agent and send it off to that guy as well okay so this is interesting I said the exact same prompt obviously to the the three agents but when I did it to Gemini it said okay I understand I'll use the Nvidia tool to retrieve the latest earnings and provide a summary please provide your query so I'm just going to I'm just going to say summarize at all we'll see if it does the same thing again okay it's doing the same thing again I'm an AI agent specializing in Nvidia financial data I'll use the tool blah blah blah um I will focus on accuracy and avoid speculation providing only factual information okay so I'm not sure what to do here you know I decided we gave it two chances so I'm not going to um give it a third we're going to just spit out what it actually said to us so anthropic right now as you can see got a 10 out of 10 open AI is getting an 8 out of 10 and Gemini got 3 out of 10 so you know come at me in the comments if you will maybe this experiment isn't perfect but I think it's kind of interesting so I updated that in the scoreboard and I just put an asterisk here because you know maybe if this would have been a six or a seven it wouldn't be so far behind as far as you know Gemini wouldn't be so far behind so maybe at the end we'll look at that again but for now let's keep moving on um with number six so for handling conflicting information what I'm going to do is in here as you can see we have the total current assets at 53 729 or over here 44 345 so in um NN what I'm going to say is can you give me a breakdown of the total current assets why is it 88500 so we'll send this off and it should come back and say like that's actually not what it is um here's the information so I apologize for the confusion the total current assets as of April is 53 million not 88 million um and I do not have the ability to explain why it's 53 million okay now we're going to send this one off to open AI so it breaks down the total current assets which it says it's 49 so it did the I don't know where it actually got that 49 from but at least it corrects us and says that the 88 isn't actually in there so we'll see but anyways finally let's send this one off to anthropic and so maybe it's because you know we put in a PDF to Pine Cone and just the way that we split it and the way it got vectorized maybe wasn't ideal as far as just like a wall of of text because this was like a PDF and it was kind of like a table chart format but anyways anthropic gives us a pretty good response back it also highlights that we didn't find the 88 so maybe we found that elsewhere but let's see what the grades look like for this test okay so number one was from Google we got a 5 out of 10 number two was from open a we got a 7 out of 10 um it also it points that it mistat the total current assets is 49 instead of the correct 53 so and then anthropic gets a nine so I'm curious as to why Google got a worse score than open AI it's because it does not provide a breakdown or explain where the discrepancy might stem from um although if I was to be grading this I would have gave Google a better score than open AI but just want to keep it consistent so anyways maybe it's because I'm using an open ey tool to create open eyes responses even though it doesn't know which one's which but I don't know finally we've got Source attribution so what I'm going to do is say please tell me exactly in the document where to find stock-based compensation so hopefully it's able to search through the different tables and the different pages find the stock-based compensation maybe even give us a breakdown of it and then tell us like what section in there we can actually go find it if we want to go look by ourselves and get more information right now open AI is doing the same thing stock based compensation details can be found in section B which covers expenses blah blah blah finally hitting it off to Google Gemini flash which I've been very impressed by the speed of flash probably why it got its name but anyways let's see what we get back from this guy so in this one the first response was anthropic it got a 9 out of 10 it provided a good detailed guide on where to find it opening eye got a 6 out of 10 and then flash got an 8 out of 10 so again I know I'm not going super in depth about each responses strengths and weaknesses just thought this would be a cool way to grade but feel free to pause the video if you want to read more about sort of the responses and why each of them got the number that they did okay so here are the end results Claude 3. 5 in first place with 8. 6 GPT 40 second place 7.

7 and then Gemini flash 2. 0 got a 6. 9 so I thought this was super interesting the green is the highest score per test and then we had three Perfect 10 which was Cloud 2.

5 with response coherence and completeness Gemini with speed and then Cloud 3. 5 with context window management in general what I've sort of seen is that I like to use gbt 40 a lot for agentic functions like calling and sort of that reasoning aspect anything that has to do with like creating information so in this case Claude is creating queries to send a the vector databases and then also taking that augmented response from the vector database and creating content for us that's more human-like human readable um if you saw my video about creating the newsletter assistant or the team of like research assistants to help with creating newsletters I was using Claude 3.