EASIEST Way to Fine-Tune a LLM and Use It With Ollama

152.41k views1073 WordsCopy TextShare

warpdotdev

In this video, we go over how you can fine-tune Llama 3.1 and run it locally on your machine using O...

Video Transcript:

you want to fine tune your large language model and run it locally on your machine using Ollama. Well, in today's video, we're going to do exactly that. So let's go.

First, for the fun part, just finding the right data set. The reason why finding the right data set is so important is when you train a small, large language model with a data set that is relevant to the task you're trying to do. It can actually outperform large models.

What I'm going to be doing today is creating a small, fast LLM that will generate SQL data based off of table data. I provide it. One of the biggest data sets to do this with is called synthetic text to SQL, which has over 105,000 records split into columns of prompt SQL content, complexity, and more.

Im running a Nvidia 4090 GPU, so I'm going to be fine tuning this on my machine using ubuntu. If you don't have a GPU, feel free to do this using Google Colab, which allows you to run training code in the cloud. The great news is that this project does not require a lot of complex hardware to get it up and running.

We're going to be using Unsloth which allows you to fine tune a lot of open source models really efficiently, with about 80% less memory usage. And we're going to be using Llama 3. 1 which is an LLM used for commercial and research purposes, especially in English, and has really high performance.

Make sure that you have Anaconda installed on your machine as well as the Cuda libraries. I will be using Cuda 12. 1 and Python 3.

10 for this project. You want to install the dependencies required by Unsloth which you can find in the Readme. But for simplicity, here it is.

This creates a new environment for us and installed PyTorch Cuda libraries as well as the latest unsloth. You'll also want to install Jupyter if it isn't there already, and then run your Jupyter notebook. And now you're done with the setup.

So let's go into the Jupyter notebook and get started. First, we want to make sure that all the installed requirements are actually there. If you're using Google Colab, this command will install the packages.

Next we're going to import the fast language model by Unsloth. Here we're specifying that we want to use the Lama three eight bit model. We also want to set up a max sequence length of 2048 tokens.

This means that the model will only consider up to 2048 tokens, where a token can be a word, subword character, or even punctuation. When processing or generating text, and we'll set load in four bit to true, which essentially means we're using less bits, as opposed to using the typical 16 or 32 bits to represent the information in the model. Doing this is going to help you reduce memory usage and also reduce the load on your machine.

After running this, you're going to get a cute Ascii image, and that means that your model is loaded. After this, we're going to load in the PEFT model, which is basically Lora adapters. If you don't know what these terms mean, that's totally fine.

Basically, the LORA adapters mean that we only have to update 1 to 10% of the parameters in this model. Without them, it means that we would have to retrain the whole model, not just a small portion, which takes a lot of time, energy, and even money. Unsworth provides this here with the recommended settings.

I trust them, so feel free to read each comment. Now this is where things can get a little bit tricky depending on what data set you're using. The each data set comes different from each other, but they're each formatted in the same way such that the large language model can understand it.

Llama three uses alpaca prompts which look like this. Now, if you remember our data set, it is not as easy as just plugging it in and letting it go off to the races. I have to format my response first before plugging in the data.

I'm only interested in the SQL of my database. The prompt I will be asking for, as well as the generated code and explanation. So I'm going to update my code to reflect this.

Now we set up the training module to supervise to fine tuning. Trainer by hugging face is what I used. There are a lot of parameters to use all that can be described in their own video.

So for example, have max steps which tells us how many training steps to perform. Seed is a random number generator. We used to be able to reproduce results and warmup steps gradually increases learning rate over time.

So now that we have everything set up, let's run it. And that's it. Your model has been trained.

Now before we move on, we actually need to convert this into the right file type so that we can run this locally using a llama. Luckily, onslaught has a one liner we can use to do this. After this is done, we only need to do one step to be able to run this with Allama.

First, open up your terminal. I'm using the warp terminal here. Go to the path of where the file is saved.

Then create a file called Model file and open it up in the code editor. This is Ollamas Docker like file configuration where we can create new models with specific parameters. In our model file we're going to put a prompt.

So something like you're an SQL generator that takes a user's query and gives them helpful SQL to use. Finally make sure Olan was running. And then we're just going to run this command.

This command will then read all the items in the model file you just created, and start using llama Dhcp under the hood to make sure that the model runs on your machine. And congrats! You can now use your fine tuned LM locally, all with the OpenAI compatible API and more in your applications.

If you're curious to know more about Alama, we do have a two minute video out about everything you need to know about Alama here. Otherwise, thank you for watching and I'll see you next time.