Optimize Your AI Models

8.21k views2295 WordsCopy TextShare

Matt Williams

Dive deep into the world of Large Language Model (LLM) parameters with this comprehensive tutorial. ...

Video Transcript:

hey there there are a lot of parameters available when working with llms like temperature and seed and numb context and more but do you know how to use them all and what they all mean and although I tend to focus on olama the contents of this video apply just as much for orama as any other tool except for the parts where I talk about the actual implementation you can find a list of the parameters in the olama documentation just go to the model file docs and then parameters the first three talk about mirror stats sampling which it's a strange place to start since it's probably not the most common one that you'll use so instead let's start our list with what are probably the most common ones I think the first one on our list should be temperature the way models work is that they guess the first word of the answer and then try to figure out what is the most likely next word of the answer and they just keep repeating that over and over and over again when coming up with the most likely next word or actually token it creates a list of words or tokens that could potentially go into that next BX and they all have a probability assigned to them that shows how probable is this option as that next token but the model doesn't actually store probabilities it actually works with what are called logistic units or logits logits are unscaled when they are first generated but they tend to be especially with llama CPP which AMA uses they tend to be between minus 10 and 10 these Logics are converted with a soft Max function to a series of numbers between 0 and 1 and if you add up all the numbers they add up to one so essentially they become probabilities temperature helps scale The Lodges before they become probabilities lower temperatures will spread the lodges out making the smaller numbers smaller still and the larger numbers even larger this means that what was the most probable before has become even more probable now but a temperature greater than one reduces the difference between The Lodges resulting in probabilities being closer together so tokens that had a lower probability before now have a higher chance of being chosen than they did before the result of this is that it feels like the model becomes a little bit more creative in the way it answers a question now next in our list is numor CX this sets the context size for the model when you look around at different models or when you see the announcement for a brand new model you might get excited when it says a contact size of 128k parameters or 8K parameters or a million parameters but then you start to have a long conversation with say let's say llama 3 . one and wonder why it's forgetting information that was actually pretty recent in olama every model starts out with a 2K contact size that means 248 tokens are in its context and anything older may get forgotten the reason AMA doeses this is that supporting more tokens in that context requires more memory and 128k parameters is going to require a lot of memory from what we've seen in the AMA Discord a lot of people are starting out with gpus with only 8 GB of memory and some are even smaller than that and so that means it just can't possibly support a 128 token contacts or even 8K tokens for that reason all L models start out with a default context of 2K or 248 tokens so if you're excited to use llama 3. 1 for its 128k contact size grab llama 3.

1 and then create a new model file that looks like this with a from line pointed to llama 3. 1 and a single parameter of num CTX with a value of 131072 that's 128 okay then run AMA create my bigger llama 3. 1 or whatever you want to call it then - F and point to the model file this will create a brand new model that has a max contact size of 128k tokens now you can run AMA run my bigger Lama 3.

1 and you'll be in your new model but let's say you're playing around with a new model and can't find the max supported size for the contact the easiest way to figure this out is to run AMA show Lama 3. 1 near the top we see the context length which is the max supported length of the model the fact that we don't see a parameter of numor CTX defined tells us that it is set to use the olama default of 248 tokens yeah that can be a little confusing but let's say you want to use Orca 2 with a context length of 10,000 tokens if you tell it to summarize something that is much longer than that you'll probably not get anything useful cuz the model doesn't know how to handle it so what do you think of this video so far click the like if you find it interesting and be be sure to subscribe to see more videos like this in the future I post another video on my free olama course every Tuesday and a more in-depth video like this one every Thursday subscribing means you won't miss out on any future videos now let's move on to the next thing in the list stop words and phrases sometimes you'll ask a model to generate something and you see it starts to repeat itself often using one strange word or symbol at the beginning of each repeat so you can tell the model to stop outputting text when it sees that symbol all the rest of the parameter only accept a single value but stop allows for multiple stop words to be used there are two other parameters that deal with repeats repeat penalty and repeat last n we talked before about how a list of potential tokens is generated along with their probability that they are the most likely next token penalty will adjust that probability if the token or word or phrase has been used recently if the logic for the token is negative then the logit will be multiplied by the penalty if the logic is positive then it'll be divided by the penalty the penalty is usually greater than one resulting in that token being used less but it is also possible to set it below one meaning the token will be used more often at this point you're probably wondering how large the window is for finding repeats well that's what repeat last n is for this defines the window the default is 64 meaning it looks at the last 64 tokens but you can set it to be a larger or smaller window if you set it to zero it disables the window and if you set it to minus one then the window is the full context of the model top K determines the length of the list of tokens to be generated for a potential next token this defaults to 40 but it can be anything you like this is pretty simple saying that only the most likely 40 will be in the list top p is a little more complicated when you add up all the probabilities in the list you should end up with one but when using top P it'll create the list of all the tokens that will add up to top P so a top P of 0. 95 will exclude all the token that when you add up the probabilities will sum to 0.

05 Min p is an alternative to using top P this looks at the source logits it takes the value of the largest logit in the list then figures out the value of Min p% of that large value all next tokens must have a logic value greater than that minimum value so if the largest logic is eight and Min p is 0. 25 then all Logics must be greater than two to be considered since 25% of of eight is two okay tail-free sampling or TFS unor Z if you create a chart of all the probabilities you'll see it slowly approaching zero tail-free sampling cuts off that tail at some point as the number approaches zero more of the tail will get cut off if using this you want to start really close to one and gradually come down a value of one means that none of that tail is cut off 0. 99 to 0.