Merge LLMs to Make Best Performing AI Model

49.16k views3117 WordsCopy TextShare

Maya Akim

This video is about mergekit, how to choose and blend models. It's non technical but links to techni...

Video Transcript:

[Music] this is mixol 8 * 7 billion parameters a powerful model loved by millions of users and developed by a team of experienced researchers at mistl though very expensive to create it has now proven its worth through top scores on all key benchmarks on the other hand this is ramonda with 7 billion parameters loved by five people exactly and my humble creation it cost around a few dollars in spite of my this funding Randa scores surpass mixol on all open llm leaderboards benchmarks except for one so how did I pull this off and is it

easy to replicate well let me start by saying that anyone can pull this off through model blending a new and experimental technique and as a non-expert in machine learning I was able to blend around 20 models in the last 2 weeks each blending process took approximately 15 to 30 minutes with the majority of the time spent waiting for files to download many of my Blended models perform well on the open llm leaderboard some even outperforming popular models however I made numerous mistakes and wasted time on unsuccessful methods so I'm creating this video to help you

become more efficient in model blending I read a lot about model blending blending in order to prepare for this video and the more I read the more fascinated and excited I became to be more specific I grouped all of my insights into three important areas that will help you understand what blending is and why should you care about it Point number one what is the promise of [Music] blending so let's begin with the question why don't you train your own state-of-the-art top performing llm from scratch right now if you blame a lack of time good

Hardware or skills but especially money you'd be correct to create a truly exceptional llm you need significant amount of resources and for example Sam Alman admitted to spending over $100 million on training the state-of-the-art gbt 4 however this doesn't mean that you can create your own top performing llm at least not for your own specific use case by fine-tuning a model you can make it behave write or speak or even draw in the way that you find most helpful imagine you have fine tuned llama model three times the first model is fine tuned to write

social media posts that align with your brand the second model writes excellent polished python code and the third model extracts information in a structured way instead of using them separately each time you could merge them into one model that can perform perform really well on all of these tasks your model might score highly on the open llm leaderboard a chart that's open to model submissions from anyone in fact many merged models currently occupy the top 20 positions and the best performing model that I've Blended ranked third among all 7 billion parameter models on the entire

leaderboard sadly what seemed to be days of Glory turned out to be just one day of Glory as the model fell to fifth place the following day point two how to do it as a little preview in order to blend the model I'll guide you through three simple Steps step one is the installation of merge kit a python tool kit that allows merging step two is specifying Which models and parameters you're using by writing them down in plain English and lastly you'll merge models by running one command in a terminal I wouldn't say that any

of these steps require programming but you do need some basic knowledge of how terminal works when it comes to Hardware I was able to merge three models with 7 billion parameters into one on my Mac with 16 GB of RAM but I couldn't do anything more ambitious than that so that's why I'm going to rent a GPU from MK compute and shout out to their amazing team for support if you're also trible by your Hardware like me you might consider renting Mass computes virtual machine there's already a Maya Akim image with the entire merge kit

Library installed that allows you to start merging right away and once you merge a model you can immediately load it via pre-installed text generation web UI to see how smart your model is you would pay only 62 cents per hour but with my link you'll get 50% discount every time you rent and in return I'll get a small percentage that helps support me and my channel but if you already have a good machine you're not interested in renting then you can install merch kit by doing the following open vs code new terminal and type get

clone and paste this URL then move to the folder you just downloaded by writing CD merge kit and lastly type pip install Das e and then Dot and this will install everything think to proceed there are two important questions you need to ask yourself when blending models Which models should you choose from the thousands of models available on hugging faces Hub and which merge method is the best for your idea so let's start by discussing merge methods and let's look at the most important ones task arithmetic in this research published a year ago researchers focused

on task vectors these vectors represent the differences between the weights of a pre-trained model and the weights of the model after fine-tuning and as a reminder weights are numbers stored in model's architecture that determine how much attention the model gives to certain words or phrases when it's trying to understand language so task Vector arithmetic allows you to manipulate these vectors using basic arithmetic operations such as negation or addition and let's start with negation or forgetting suppose model one was fine- tuned on a data set with negative sentiments about cats resulting in a task Vector that

Associates negative attributes with cats such as cats are evil and in contrast model two could have been fine-tuned on a data set that portrays cats positively leading to a task Vector with with positive statements like cats are adorable by adding the positive task Vector from model 2 to negative Vector from model one these two vectors would balance each other out reducing the negativity towards cats now let's discuss addition again consider two language models Model A F tun to generate factual descriptions of cats and model two fine-tune to generate poetic descriptions of animals by combining these

models using task Vector arithmetic the resulting model would generate descriptions of cats that are both accurate and poetic one of the best things about task Vector arithmetic is that it allows blending multiple models one method for blending models is slurp which stands for a spherical linear interpolation despite its intimidating name the application is quite simple this method finds a middle ground between models trained to have different opinions on topics such as let's say Crocs model one might have a strong bias towards the comfort of Crocs while model two might have a milder bias towards their

style slurp first ensures that both opinions have equal importance then measures the difference between these views and finally finds a common opinion that takes a bit from both sides however this method allows for the merging of two models only ties and dare these are very similar methods with slight differences that allow for merging of multiple models when a model under goes fine tuning not all parameters change equally so the Thai method focuses on identifying parameters with the most significant changes if two or more models suggest opposing adjustments to the same parameter it resolves the conflict

by creating a vector that represents the dominant Direction the Dare method also prunes or resets certain parameters of the fine June model rescales weights and introduces a little bit of Randomness and lastly there is the pass through method you can concatenate or chain together layers of different models to create a new model with an unusual number of parameters and these are known as Frankenstein merges or Franken merges I covered the basics and I've tried to explain the methods in the least technical way possible however I'll provide links to the research so that you can explore

further and let's blend the model the best way to begin is by selecting and downloading models you want to blend from hugging face so how to choose the right model well firstly it's important to knowe that mixing models with different architectures won't work which Narrows down the options although merge kits GitHub suggest that you can merge models of different architectures users have reported issues and the Creator himself has confirmed that this feature is still experimental and not yet possible and in my own experiments I found that blending llama and Nal models resulted in a model

that produces gibberish output therefore it's necessary to choose and stick to different F gen versions of the same model additionally model should have the same number of layers in order to avoid specific errors so to make sure that you're working with models from the same family you can use a helpful Google collab notebook created by Maxim Leon that displays the entire family tree of each model approach that worked out the best for me was to check out the open llm leaderboard and try blending well performing models together to download the save a model from hugging

phase you should type the command get lfs clone in the terminal and copy paste the URL of the model the next step is to create a yaml file Legend has it that yaml originally stood for yet another markup language and it is designed to be easily read and written by both humans and computers and it's commonly used for configuration file the content of the file will depend on the merging method that you're using let's begin by creating a new file with yaml extension and let's start with task arithmetic so you will need to specify the

base model and the path to where you saved it as well as the merge method and a dtype that specifies the data format used for merging you should also specify Which models you're going to use for the merge along with the parameter weight which defines the influence each model will have on the overall blend in this case the second model will contribute more than the first and the total weight of all models should add up to one for slurp you will need to define a merge method base model dtype and slices unlike the previous example

where the parameter model defined the entire model to be used for merging slices Define parts of layers from different models to be used another important parameter is T or the inter potion Factor if D is set to zero the base model will be returned as the result of the merge which is pretty pointless setting T to one would return the other model again pointless so instead you could write a gradient that merges layers in a smooth gradual way also one of the two models that you're defining in slices should also be the base model ties

and dare ties are similar to task arithmetic but they also introduce another parameter called density which is intended to optimize the merging process density defines the fraction of the most significant parameters of each model that should be retained in the merged model and it looks like setting this parameter to higher values produces better merge models at least according to this GitHub conversations so you don't have to worry if density parameter doesn't add up to one if it's above one in total and finally the pass through configuration is similar to slurp in addition to defining the

standard merge method and D type you should also specify how many slices of which model will contribute to the final Blend one thing to keep in mind layers should always overlap otherwise if you just stack all layers of one model on top of all layers of the other model you'll probably get a model that's not very coherent trust me I tried this and the results can get very very very weird okay let's blend the model now to do so you can run this command in the terminal merge kit Das yaml followed by the name of

the yaml file which is in my case config yaml but it's okay if you're file has a different name and you should also specify the name of the folder where you want your merged model to be saved call it merge or merged model or whatever you want then Das Dash allow crme s allows you to mix models with different architecture which doesn't work yet copy tokenizer will copy the tokenizer from the base model uh out shart size 1 billion will chunk the model into smaller pieces so that it can be computered on a CPU with

a smaller RAM and lazy and pickle uh is experimental and it will lower memory usage to automatically create a little read me file that looks like this and contains all the information about the merge just add right model card now you can let the merge begin typically process takes only a few minutes and after this I like to load the model in text generation web UI and do inference to see if everything is okay and if you're satisfied with the model you can upload it to hugging face and submit it to the open llm leaderboard

and this will allow you to see how well your model scores on all benchmarks and will will it this will also enable others to discover and use your model to do that you can install hagging face Hub by typing pip install then hugging face Hub next write hugging face SLI login which will allow you to paste your huging face token and you just have to make sure that it's read and write token and finally type hugging face SLI upload and then your hugging face us username slash whatever you decided to name your model and the

name of the folder where your merged model is saved so once you're done with this this will initiate the upload process so you might be wondering which method is the best and I had that same question and although I'm pretty sure that the answer depends on many factors I decided to conduct a small test I merged two identical models five times each time using a different method method to see which one would perform the best on the open llm leaderboard I managed to evaluate only four models because fifth model merged with pass through method failed

the evaluation the worst performing model was merged with ties followed by slurp on the third place and D on the second place and the best performing model was merged with task arithmetic however all four models have more or less similar results so let's move on to the final Point contamination two months ago a very disappointed Reddit user created a post with a somewhat inflammatory title open llm leaderboard is disgusting so why would someone feel so strongly about a table the leaderboard is supposed to rank the best performing models based on several well-known benchmarks however these

benchmarks are often to blame for why we have at least one this model just GPT 4 hype per month when a company releases a model with higher scores than GPT 4 so what exactly gets measured when you upload your model to the open llm leaderboard firstly there's ai2 reasoning challenge which consists of grade school science questions helis swag is a sweet Benchmark that checks Common Sense by making the model complete sentences in the most logical way which is surprisingly hard for models although it's easy for us humans MML U measures how diverse a Mod's knowledge

is and consists of various tests across all kinds of subjects truthful QA tests how truthful models answers are because llms can accidentally pick up and learn false information from their training data which is not surprising because they're trained on the internet and wrain checks Common Sense skills and GSM 8K measures a model's ability to solve mathematical reasoning problems Sometimes some of the questions used for these benchmarks end up in model's training data set which is then called Data contamination if any of these models have been trained or fine-tuned on questions that are part of the

Benchmark they end up scoring highly simply because they're optimized for these questions however this doesn't mean that they're actually intelligent goodhart's law named after British Economist Charles goodart states that when a measure becomes a Target it ceases to be a good measure and this definitely resonates with open llm leaderboard problem so to get back to the Reddit post the redditor listed several models that had been merged to create the state-of-the-art model at the time all models used for the merge had previously been fine-tuned on the same or similar data sets which can lead to overfitting

overfitting occurs when a model learns the details in the training data a little bit too well the model as a result fails to generalize and apply knowledge to new unseen data given that the models used for the merge probably contain contaminated data in their fine-tuning data sets it becomes clear that merging them doesn't always improve a model it just optimizes the merged model for the leaderboard potentially bringing a day or even few days of glory and to get back to the merged verond the model from the beginning of the video is it really better than

mixol just because it's scored highly on the leaderboard well it's hard to tell because both models sound smart but if somebody told me that my model isn't better than mixol it's just contaminated I wouldn't be surprised but ultimately this opens the question of how helpful benchmarks actually are and that's beyond scope of this video so to merge models and to top the leaderboard in in an honorable way simply try to make sure that there's no data contamination you can do this by merging pre-trained models or carefully selecting fine-tuned models and that's it for today I

really hope you enjoyed this video and now go make some really cool models and have fun