SakanaAI Unveils "Transformer Squared" - Test Time LEARNING

30.29k views3372 WordsCopy TextShare

Matthew Berman

Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai My Links 🔗 👉🏻 Subscribe:...

Video Transcript:

it seems this is the week of the Transformer successor Sakana AI just dropped a brand new paper called Transformers squared and in this paper they detail a method in which a large language model can be updated at inference time based on the prompt given and the best part it's open source they've already released the code to do this and it can be applied to any open- Source model let me break it all down for you right now so this is the paper Transformer squared self- adaptive llms by Sakana a I and if you don't remember

Sakana AI is a Cutting Edge AI research company out of Japan they were the ones that produced the paper and code for the AI scientist which I made an entire video about and it basically automates the ability for AI to perform open-ended scientific research so very very cool stuff so I'm going to dive into the details of the paper but the gist of how Transformers squared works is it basically takes two passes at a prompt during the first pass as we're seeing here with this nice little graphic we're actually taking the user query running it

through the model and trying to understand what the query task is so is it a math question as exampled here is it a coding question is it a logic question and then it actually updates its model weights at inference time to better answer the query at hand so we've seen a lot of this in the paper that I just talked about out of Google the Titans paper they did a lot of the same thing and it seems like that is the new trend trying to make models less static and that is one of the big

problems that we've had with the Transformers architecture to date basically once the training is done that model is fixed in time it does not learn new information unless you give it external information via rag for example but what if it could what if you could take a model that had a lot of core intelligence and then allowed it to learn and update its models over time without a huge post-training cost this potentially could be another scaling law and so that is what Transformers squared does so let me break the paper down self- adaptive large language

models so that is what Transformers squared aims to be aim to solve the challenges posed by traditional fine-tuning methods which are often computationally intensive and static in their ability to handle diverse tasks so rather than fine-tuning they are proposing a new method in which you almost surgically go into the model and change very specific way weights at test time at inference time based on whatever the user's prompt is so here is how they describe it self adaptation framework that adapts llms for unseen tasks in real time by selectively adjusting only the singular components of their

weight matrices that's a fancy way of saying it's updating its weights but only the ones that are specific to The Prompt at hand and that is why the two pass approach is necessary the first pass being let me understand what the prompt actually is asking for so let me explain what the two passes are so the first pass is a dispatch system that identifies the task properties and then the second pass is using task specific expert vectors trained using reinforcement learning are dynamically mixed to obtain targeted behavior for the incoming prompt again simply put understanding

The Prompt and updating the weights to better answer the prompt our method outperforms ubiquitous approaches such as LA with fewer parameters and greater efficiency and I will show you the performance in a little bit now again this isn't the first time we've seen test time training and in fact there was literally a paper called test time training that I covered so this is kind of The Cutting Edge technique that a lot of companies are using to update the model not during pre-training and not during post-training but during inference time which is a really interesting approach

to make models less static more Dynamic and the cool thing is it also works with vision models and in fact really well with vision Models All right so let's talk about how model training works today this is before Transformers squared or any of the other papers that we've talked about recently llm posttraining has sought to optimize a model for a wide range of capabilities in a single extensive training session fine-tuning oneshot fine-tuning framework is ideal from a Simplicity perspective it is also difficult to achieve in practice for instance postt trainining is still highly resource intens

I there tends to be notable performance tradeoffs when introducing additional breadth to the data making it challenging to overcome overfitting and task interference at the same time so if you're not familiar a model when being created after it's done with its pre-training that initial run where you're instilling all of this knowledge into the model then you could do something called posttraining in which you're guiding the model on how to respond you're giving it additional information you're basically showing it how to follow instructions properly so you're taking that raw knowledge and you're giving it some guidance

on how to use the knowledge but that's very costly and very inefficient to do and what they're proposing here is as follows the self- Adaptive model offers a more flexible and efficient approach rather than attempting to train an llm for all tasks in one step here's the key expert modules can be developed offline and augmented to the base llm on demand this allows the model to dynamically modify Its Behavior based on the task at hand without the need for constant retuning it supports continual learning enabling the model to add new skills over time without catastrophic

forgetting so remember back to the Titans paper that we just covered traditionally when you add a lot of new knowledge to a model the model tends to forget other things and that's the memory problem that Titans was trying to solve for so this another approach is also trying to solve for that issue self-adaptive llms mirror a well-established principle in neuroscience and computational biology which let's pause for a second it seems like all of the companies that are coming out with the next iteration of Transformers seem to be trying to model its architecture much more closely

to how the human brain works and that is the basis for a neural network that is why it was created originally because that's how the human brain works the brain activates specific regions depending on the task at hand and dynamically reconfigures its functional networks in response to changing task demands so if you're doing some creative writing if you're doing math if you're writing code these are different use cases and your brain behaves differently based on whatever task you're trying to accomplish and now they are mimicking that with Transformers squared so again think about you have

the core model and then you have these expert modules so an expert at coding an expert at math an expert at creative writing and at inference time at test time the framework tries to understand what the task at hand is and then like a surgeon goes in and changes the model weights using those expert modules and it's very very efficient according to this paper so these expert modules are fine-tuned via techniques such as low rank adaptation which we've covered on this channel these expert modules can then be dynamically composed at runtime based on the task

Demands a process that can be efficiently managed through mixture of experts like systems but it's not that easy and so we're going to cover some of the challenges that need to be solved which they apparently did with using this method so first fine-tuning llms to create multiple expert modules significantly increases the number of parameters that need to be trained obviously you have a bunch of other submodels and all of those additional submodels just take more and more parameters and second these expert modules are often prone to overfitting something that is a problem in the world

of AI in general a phenomenon especially prevalent when training on smaller data sets or narrow task domains so if you're looking to create really narrow expert modu modes you are going to run into the overfitting problem and then third the flexible composition of these expert modules also presents largely unresolved challenges currently posing as open research problems basically it's never been done before so their proposal to solve it singular value fine-tuning a novel parameter efficient fine-tuning method so PFT which you've probably heard of if you watch this Channel at all to obtain effective building blocks for

self adaptation svf works by a extracting and tuning only the singular values within the model's weight matrices that's the important part it is again kind of like a surgeon it goes in and really in a very narrow manner changes only what is necessary to give that model that additional expertise so here's a nice graphic from the Sakana AI blog about this paper on the left we have the llm brain and we can see that it's all basically one thing but on the right what they've actually done is segmented the different parts of the brain or

the llm into language reasoning and coding which allows it to at inference time know which parts of the brain or the llm need to be updated so if we're looking at figure one which is what they referenced in that previous paragraph we can actually see what's going on so we have the user query and the self adaptation vectors are kind of like that brain diagram that I just showed you so we're going to go in and we're going to figure out based on the query we're going to run it once and see well is it

math is it coding is it somewhere in between who knows it could be anything and then from that we're going to update the weights so this is a math question then we come in for the second pass in red and we can see the red arrow right here we go in again after the weights have been updated and then we actually have the answer and how does it perform we show that svf consistently outperforms traditional strategies for efficient fine tuning such as Laura and at the same time with orders of magnitude fewer parameter so what

we're going to see in the actual results is it performs better not exponentially better but it is orders of magnitude more efficient so how do they actually figure out which weights to update they use two methods so first they use SVD singular value decomposition which offers a fundamental view of Matrix multiplications then cross entropy method cem which is a Monte Carlo method for important sampling and optimization both of these is is basically just saying how do we figure out what to change so a little bit more detail about singular value fine tuning so it is

a key building block of Transformer squared it offers an extremely efficient parameterization for fine-tuning and provides inherent compositionality for adaptation basically it works well so conventional fine-tuning techniques often aim to augment pre-trained models with new capabilities by modifying their weight matrices however in large scale Transformers these weights are already rich repositories of abstract knowledge thanks to the breadth of the pre-training data and expansive architectural design and so their solution is instead of seeking to add new features take what you already have and make them better and so there are three ways that they've done this

now I actually had to get a Little Help from AI to help me understand what these meant and to give some analogy so I'm actually going to reference what I got from Ai and hopefully that helps you understand what these three things are so first negligible parameters so it says like choosing the best knobs to turn imagine tuning a massive system with hundreds of dials and buttons most methods for fine-tuning a system like Aur involve adding a bunch of extra knobs and switches to control things it works but it's bulky and complicated on the other

hand it figures out which existing knobs are the most important and only tweaks those so that is negligible parameters next high compositionality building with Lego blocks so a different analogy think of the weight adjustments in a model as Lego blocks in svf each parameter is like a separate Lego piece that can be easily snapped together in new ways you can combine or modify them to build something new other methods like Laura are like glued together Lego structures once the pieces are stuck together it's hard to take them apart and so reusability is a big part

of svf and then principled regularization coloring inside the lines so for this if you give kids crayon and tell them to color and there's no guidelines or anything they're just going to color a white piece of paper or whatever it is and there's really not going to be a lot of structure to it but with principled regularization it basically gives them a coloring book and some guidelines and maybe they're coloring a penguin for example so it gives that black and white outline of what to color and then it does a much better job of coloring

it because there's actually some kind of semblance of guidelines so let's again dive into the two passes the first pass going through trying to figure out what this task is actually about and so how they do that is with pretty straightforward methods actually so there are three methods one prompt engineering so our most basic approach involves constructing a new adaptation prompt which we use to directly ask the llm to categorize the input prompt so here's an example of it so analyze the given question and classify it into one of four categories code math reasoning or

others follow these guidelines so the input is basically categorized into one of those then some additional instructions and then how to actually format the output so very traditional prompt engineering ing next is a classification expert so it's a direct extension of the prompt engineering approach and comes from using a specialized system to handle task identification I believe what they mean is a separate system model that will actually do the classification into one of those four categories and then third few shot adaptation so leveraging additional task information by assuming extended access to his test time conditions

Beyond individual prompts so just using the same systems as before but being able to iterate multiple times all right so how does it actually perform well let's start looking now what we're looking at here in these four charts are the performance of llama 38b instruct using their new method now the dotted line the dotted black and white line right there is the score based on the original instruct model of llama 38b then in red we see what happens at test time and then in blue we see what happens at training time using this new method

Transformer squared so you can see across the board it is performing better and specifically the training time is significantly better we find that svf provides considerable and consistent performance gains across nearly all tasks and base models this observed Trend extends also to the vision language domain which is very interesting svf requires considerably fewer Resources with less than 10% of the training parameters of our Laura implementation so better but much more efficient ask I've said and it applies to different models so all of our Transformer squared adaptation strategies show Improvement across llama 38b instruct and at

least two out of the three tasks for mistl 7B instruct and llama 370b instruct so this isn't model specific and so we're seeing a lot of those results right here these are the three different base models right here then we have the Laura implementation and we have the svf implementation which is the basis for this paper and as we can see in bold for GSM AK K performs better across the board for mvpp pro it performs better in two out of the three and same for Arc easy two out of the three so really good

but remember the key is it performs better but much more efficient but you're probably asking yourself okay if it has to take two passes isn't that going to be really expensive or take a lot of tokens and the answer is a little bit more but not really so let's take a look here so in this table we're seeing the time cost of the second pass inference here we have math human eval and Arc Challenge and these are in seconds so here are the first scores which are really a small fraction of the second pass and

then we have the second pass and you can kind of think of the second pass as a normal pass so it really doesn't add a ton of time to the overall run of these benchmarks now the exception to that might be the arc challenge but they actually explain that down here so the second pass inference time is the time spent on on solving the problems and the first pass inference time is the time for self adaptation so just an explainer of what you're looking at then while the additional inference pass might appear to double the

overall runtime it is important to note that inference time primarily depends on the number of tokens generated and the arc challenge in particular The Arc challenges cost ratio is large because they are single Choice problems and therefore the cost of the second pass is also the same as the first pass so that's why they're very similar in time and how good is it at figuring out what type of task is it well that's what they call the confusion matrices and that's kind of a cool name so what we're seeing here is llama 38p based on

prompt engineering here's the same one based on a classification expert here's prompt engineering classification expert and then llama 370b prompt engineering and what we see is if it's 1.0 that means it has perfectly identified the task and what we see is for Lama 38b prompt engineering The Arc challenge was actually the one it kind of struggled with but it still did quite well 77% then using the classification expert really really good 95 98 97 and we're seeing the same results basically across the board the only one it really struggled with was prompt engineering specific to

Lama 38b specific to the ark challenge all right so in conclusion this is a new method that allows the kind of traditional Transformers architecture to learn over time and it happens at test time and it's it's really efficient so another great example of a Cutting Edge research lab creating what may be the next generation of the Transformer model Transformer squared the idea of allowing these model weights which after the original training are really fixed in time and don't change to change is a really cool idea and allows the model to evolve over time rather than

continuously having to come out with new models over and over again and so we'll see if this new method catches on I'm very excited about it definitely try it out they posted the code I'll post it down below in the description so you can try it out yourself right now if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one