hey I'm Mark Hennings I'm one of the founders at entrypoint Ai and I'm looking forward to talking to you about prompt engineering retrieval augmented generation fine tuning how they're similar how they're different and how they can work together so let's start with prompt engineering even though we're all probably pretty familiar with this it's a good starting point for how we can add Rag and fine-tuning to some prompt engineering that we've already done a typical prompt has some kind of priming like you are a plumbing Q&A bot that answers questions about Plumbing in a helpful way
and then you might tell it to use you know what kind of language to use based on who your audience is um how to handle edge cases and errors you know to ignore things that are trying to hijack your prompt um don't answer questions about areas that aren't in your expertise right and then we have this user inquiry so this is a thing that's changing it's the dynamic content in our prompt where it's changing each time we're running our requests to the llm and then finally some kind of output formatting now it doesn't have to
be in this order and in this case the Json object is pretty basic it's not really helping us we could just have it export text um but that's a very common thing is to want to have your output in some kind of a consumable structured format all right now when we add retrieval augmented generation we're basically just adding another Dynamic piece of content to The Prompt it's this knowledge we want the model to use to answer the user inquiry so if the inquiry is how do I fix a leaky fuc it we're going to do
some stuff in the background we're going to take that inquiry and we're going to try to find the right information to answer that question and then insert it into the prompt so here we have some knowledge from the plumers handbook chapter one fixing common leaks um so if we put that in there then the model is going to have the information it needs to answer the question and this is really important because llms don't store facts they store probabilities large language models for example have been trained on ACH bunch of quotes said by people exact
quotes but they are never going to remember those quotes verbatim they're going to predict the most likely token and they're going to return those quotes in kind of a summarized or paraphrased way because that information has kind of been compressed into these probabilities of what word is going to follow so if we want a model to actually use very specific information then we need to provide it in the prompt and they are very good at dealing with information in the prompt and staying grounded staying true to that in their responses large language models are very
sensitive to the information in their prompts and so providing the right information there is very powerful it also allows us to expand the knowledge of the large language model by pulling in data from external sources which can be updated in real time too so here's how that process works in actually retrieving the information first we have to set it all up we need a database that has our information in it so first we're going to start with a corpor purpose of text this could be a bunch of web pages PDFs books you name it your
company's Internal Documentation or help center um some kind of information that you want the large language model to be able to work with then we're going to split it up this could be by paragraph bsection um you know you want to try to keep related text together so that if you're inserting it into the model then you're getting like a whole concept because you want useful information that the model can use to answer questions so there's a lot of ways to split your data but you need to split it up make a choice there and
then insert it into your database um along with that text in your database you're going to convert the text into an embedding which is this vector format it's a mathematical structure um that can actually you compare can compare one vector to another vector and judge their similarity and that's what's going to allow us to do the knowledge retrieval here in a minute we use an AI model to actually generate the embeddings and then we store the text and the embeddings in our database now when it comes time to calling the large language model we need
to retrieve some information we have this user inquiry like how do I fix a leaky faucet so we take that inquiry and we use the same model that we use to generate embeddings that are stored in our database to create an embedding from that inquiry and then we go to our database and we search using Vector search comparing the distance between two vectors to find the most similar um or relevant results then we take those results we put them into our prompt and we send it off for generation seems easy right well it gets a
little more complicated so the devil's in the details here and it really is easy to kind of create a demo that can work with a specific inquiry and then knowing that it'll pull up the right data for that inquiry but if you want consistent results for all kinds of different inquiries there's a lot of optimization steps that need to be added and I don't have an exhaustive list of that but I'm going to show you a few so you can get an idea the first one is to pre-process that inquiry into more of a topical
keyphrase that's more likely to match up with one of your chunks of data in the database so if the question was how do I fix a leaky faucet you could use an llm to summarize it and then your term would be fix leaky faucet which maybe remove some of the fluff or some of the extraneous parts that could be in the question then we're going to do exactly what we did before create an embedding from it search the database but before we just pass the results directly into the large language model for Generation we might
use an llm again to say which of these results is most applicable because if we're feeding the model bad information its answer is going to be off topic or wrong so if we can do an intermediate step where we ask the llm to pick the most applicable results first we can be more confident that our model is going to be using the right information then we build our prompt and we generate um but before we pass the generated content back to the user we could add another step and use an llm again to do self-reflection
and say is this a good and accurate answer if not rewrite it make it better and this just gives it one more opportunity to fix any errors in its response so as you can see a full-fledged rag process has a lot of moving parts now let's talk about fine-tuning so with fine-tuning you're actually training a foundation model on examples of your own prompt completion pairs prompt completion pairs are what you're giving the model in the prompt and then what is a good response back this is what I would want the model to give me back
given this input fine tuning is really useful when you're trying to teach intuition where words fall short let's imagine you're like a really good writer and you can just get in a flow State and write amazing content you've been doing this for decades but somebody asks you like what makes you a good writer like what are the 50 techniques you use to become a good writer that might be a hard question to answer and if I were to try to explain that in a prompt to a large language model of how it should do good
writing in my particular style I might struggle with that but if I can write then I can create prompt completion pairs that show how I write for a given topic or how I take a draft and I revise it into something really amazing and that's intuition and you can't teach intuition through a prompt by describing things with with rules and instructions but you can teach a model intuition by giving it examples and having it update its weights so this is really cool for baking in your style tone and formatting to your outputs this actually allows
you to remove a lot of the stuff out of the prompt because the model already understands it um that reduces your prompt length which in turn allows you to have longer completions and another cool way to use it is to train a smaller model to perform at the level of a higher model because making these models bigger and bigger is just not sustainable they get slower and more expensive the more parameters you add so even though they become more capable there's tradeoffs so what we need how we need to be thinking is selecting the right
model size for the task not just always using the biggest model fine tuning also Narrows the range of possible outputs from your model which is really helpful in preventing unwanted Behavior now unfortunately there's a lot of misconceptions floating around the internet about fine-tuning and how it works um so I want to get ahead of some of those right now people think that fine-tuning teaches a Model facts as we discussed models don't really store facts they store probabilities so while you might incidentally get back bits and pieces of your training data um it's not a guaranteed
functionality if you want the model to reference facts the best way to do that is to provide it in the context window in the prompt using uh technique like rag or any kind of knowledge retrieval it's also common to believe that fine tuning requires this really large data set and that is not true anymore with the foundation models that we have today we can get really cool results with just a handful like 20 examples another one is that it's too expensive which is just not true maybe it used to be true but now we have
parameter efficient fine-tuning techniques and if you can use just 20 to 100 examples and start to get meaningful results that we're talking pennies and dollars we're not talking thousands of dollars or millions of dollars another one is that it's too complicated which I totally get it's why we created Point AI so that you could deal with fine-tuning at a higher level and not worry about all the complexities of writing code or making API calls or the underlying Hardware you can just focus on your use case and the training data and then running that and getting
results and finally people say that it's incompatible with rag like you have to make this choice between rag or fine tuning and I'm going to show you exactly how they can work together so here's two fine tuning strategies you can keep in mind the first strategy is going all out on quality and in this scenario you take the largest possible Foundation model and you train it on your examples to get better output think of it as an extension to fuse shot learning so if you could provide a couple examples in the prompt but it starts
to get longer and longer just move those into a training data set find youe a model and now your model has been trained on your examples and it's going to be able to do a better job the second strategy is to optimize speed and cost in this scenario we pick a smaller model and we train it on an example data set to try to get it to perform at a higher level um as good as one of the large models would do with our engineered prompt it may require a larger data set especially depending on
how small of a model you want to pick an optional part of this is reducing your prompt size along the way so that you can um you know have a larger context window and save costs there too so I mentioned fine-tuning as an extension to fuse shot learning and here's an example of fuse shot learning in a prompt where you have two fuse shot examples which adds up to 48 tokens for every request here um and basically this is a sales lead qualifier so they're trying to decide if an inquiry from a uh marketing form
on the website is qualified or unqualified help I just had a pipe break in my house and there's water everywhere send someone ASAP this is probably a pretty good lead um they definitely need someone they're probably willing to pay for it um someone offering a small business package to new customers um it sounds like spam to me just some junk so unqualified now unfortunately let's say I have other scenarios that I want to cover in my training data for my particular company and like I think this lead's qualified and that lead isn't eventually our prompt
is just going to get longer and longer and it's going to get more expensive but with fine tuning we can take those examples out of the prompt and put them into a data set and with as few as 20 examples you can really like I mentioned earlier you can really start to see the Model Behavior change so here's what that looks like now our prompt we don't have those examples in it anymore but we have our training data with our examples and we're able to add as many more examples as we want and show the
model what is a qualifi lead and what isn't and as we go on and we find more and more edge cases we're just adding to our training data so it becomes this scalable layer backing our model um that gives us more Assurance in the type of output we're going to get now in terms of speed fine-tuning can make a huge difference even if you just jump down from GPT 4 to GPT 3.5 turbo you can see that the response times for 3.5 turbo are almost three times faster than GPT 4 for a lot of user
experiences that's going to make a really big difference smaller models are also much cheaper so making that same leap down and fine-tuning it we can save almost 90% in terms of our cost um and this adds up so much especially if you have a large volume of requests okay so now we understand Rag and we understand fine tuning and unfortunately fine tuning doesn't have a super cool acronym like rag um and there's a lot of different use cases for fine tuning so um when they're actually creating a model they do instruction tuning and they do
safety tuning um so for the type of fine tuning that I'm talking about where you're just trying to get better output for your Generations I think it would be pretty cool if we call that tuning augmented generation so then we'd have Rag and we'd have tag and we could put them together and we could have a rag tag team here's a fine-tune model prompt with Rag and and it has just the dynamic content everything else has been baked into the model the model knows what to do with the inquiry and the knowledge and it's going
to act like a Q&A bot whether we tell it to or not we've shown it what to do through our training data so let's review we have these three techniques prompt engineering it's awesome because it's easy to work with you can do rapid prototyping it's very intuitive to just write instructions and get what you want back rag is really powerful because it allows you to connect external data sources you can have Dynamic knowledge in the prompt that grounds it to your facts and it's real time so as you add more information to your database that
can be referenced by the large language model both prompt engineering and rag deal with the prompt so it's limited to your context window you just can't insert all of your knowledge base into the prompt you have to be really selective and get the right information fine tuning allows you to narrow the model's Behavior get more predictable outputs and bake in the style tone and formatting uh just like prompt engineering fine-tuning steers the behavior of the model and just like rag fine tuning allows you to apply data and domain knowledge and your model becomes more capable
because of it and the thing that they all have in common is that they all allow you to get better outputs and they can all work together and be tools and techniques that you have in your toolkit for working with large language models thank you so much for watching I hope this was a really helpful over rview of these three different techniques again I'm the founder at entrypoint Ai and we've created a fine-tuning platform to make fine-tuning a lot easier I'd also love for you to join our master class we host weekly on fine tuning
um it's a great way to get hands-on experience fine-tuning large language models and see how it works so that you can start applying it to solve problems with AI in your life and business