Google research just dropped a new paper that might be the successor to the seminal attention is all you need paper that started this entire new AI wave this paper details an approach called Titans which details a way to give models memory much more akin to how human memory works with long-term memory and it happens during inference time with an interesting surprise mechanism this is all quite complicated but I'm going to try to break it down for you and really simp simple terms so this is the paper Titans learning to memorize at test time and this
is out of Google research now the abstract of the paper describes the problem with Transformers which is the context window is limited because there is a big penalty as you increase the context window based on the Transformers architecture but what if you didn't have that limitation what if you can have infinite tokens in a context window and it's still performed really well that is what Titans aims to deliver the way it's described in the paper is this more accurate modeling of dependencies however comes with a quadratic cost limiting the model to a fixed length context
so when you think about the context window of major models GPT 40 let's say 128k tokens even the biggest context window with Gemini is 2 million tokens and that's huge but as our needs grow we need ways to grow past that context window and so what they're going to describe in this paper is a way for the the Titan based models to learn to actually have long-term memory and to figure out what to focus on what to give its attention to our experimental results on language modeling Common Sense reasoning genomics and time series tasks show
that Titans are more effective than Transformers they further can effectively scale to larger than 2 million context window size with higher accuracy and needle and a haystack task compared to baselines so if this is true this is a really important paper so so in the introduction they first talk about Transformers and why the attention mechanism was so important Transformers pure attention-based architectures have been firmly established as state-of-the-art models in sequence modeling basically the vast majority of language models that you use today from open AI to anthropic to llama deep seek everything they are based on
the Transformers architecture and the attention mechanism and it is the state-of-the-art because it is due to their Inc context learning and ability to learn at scale the scale is the important part but there is a major drawback to the Transformers architecture which is it comes with quadratic time and memory complexity in terms of context length basically it starts to struggle with longer context lengths the more information you give it during inference time or in the prompt the worse it performs now this is all relative because with 2 million tokens that's quite a lot you can
load up entire videos in that but again as our need to input more and more information into these models grows then this problem becomes bigger and bigger so in complex real world task video understanding long-term time series forecasting the context window can become extremely large making the applicability of Transformers challenging in these Downstream tasks so that is the problem that is what they are trying to solve with Titans and so what Titans really aims to do is model their architecture much more closely to how the human brain works and the human brain has multiple types
of memory it has short-term memory it has long-term memory it has meta memory and how all of these different memory types work in coordination but also can work independently is a critical feature of how the human brain works and how our thought processes work and so that is not something that the current Transformers models do and so again that is what Titans aims to do give it multiple types of memory and allow these different types of memory to work together we argue that in an effective learning Paradigm similar to the human brain there are distinct
yet interconnected modules memory modules Each of which is responsible for a component crucial to the learning process all right so now they take a 10,000 foot view here what is memory why is it important so memory is a fundamental mental process and is an Inseparable component of human learning so learning plus memory two different things but interconnected and if we didn't have memory humans and animals would be restricted to basic reflexes and stereotyped behaviors so think about the most basic types of organisms they are just reacting to their environment nothing more now whether that's the
case with humans and it's just so complicated we can't actually see that and we think we have memory that's a discussion for another time so taking inspiration from the common definitions of memory and learning in neuros pychology literature most existing architectures consider memory as a neural update caused by an input so you see something you hear something you smell something that is the input and that is the way a memory is formed and Define learning separate from memory learning as a process for acquiring effective and useful memory given an objective and so in this paper
they aim to answer a few questions one what constitutes a good structure for the memory two what is a proper memory update mechanism three what is a good memory retrieval process four how to design an efficient architecture that incorporates different interconnected memory modules remember short-term long-term meta memory and five is a deep memory module needed to effectively store remember long past and again that is one of the drawbacks of Transformers it really struggles with very long-term memory and so what if you could bake that memory directly into the model now here's the important part in
this p paper we aim to answer the above five questions by designing a long-term neural memory module that can efficiently and effectively learn to memorize at test time I really want to emphasize that test time not during pre-training this is at the time when the model is actually running when you're giving it a prompt and it's figuring out what to respond with that is when they want to give the new memory to the model so it learns how to memorize store the data into its parameters at test time and if this sounds familiar it should
be I covered a paper called test time training which basically in the most simple terms allowed a model to learn based on its prompt it can actually update its parameters during that prompting and inference time now here's the most interesting part in my mind we designed this memory module so an event that violates the expectations more simply being surprised is more memorable now let's think about how humans work think about when you're doing something really boring or something that you've done a million times something really repetitive those actions are so ingrained in the way that
you do things that you just kind of zone out and you kind of just do them without thinking about them think about think about driving do you ever show up somewhere and then think oh wow I don't even remember driving here it's because you've been driving so long you kind of don't think about it you don't have to think about it and that is different from when when something surprises us now think about this you're driving and then all of a sudden you get cut off or there's a huge accident in front of you or
anything a tire falls from the sky whatever it is you're going to remember that because that was a surprise and not something that you would encounter usually and so that kind of function is what they're trying to give the model there's this surprise mechanism and they actually baked it into the architecture of how memory works very fascinating so when the model becomes surprised it knows okay I need to memorize that so we measure the surprise of an input we present a decaying mechanism that considers the proportion of memory size and the amount of data surprise
resulting in better memory management so basically when something surprises you it has a high memory let's say Factor at the very beginning and then over time it kind of learns to give that less attention or less priority whatever you want to call it so again think about something surprising that happens in your life right at the time of the surprise you are thinking about it a lot it is in your memory it is deep in your memory but then as time goes on that memory starts to Decay you start to forget details about it and
days weeks months years on that memory becomes more abstracted and less important because it is less surprising in the current time this Decay mechanism is in fact the generalization of forgetting mechanism in modern recurrent models so what is the Titans architecture it is the ability to incorporate that type of learning that type of memory into an AI model an important remaining question is how to effectively and efficiently incorporate memory into a deep learning architecture we present Titans a family of deep models that consists of three hyper heads core long-term and persistent memory so think about
core as short-term memory that is the most important memory in that moment and it is responsible for the main flow of data then they have long-term memory which is responsible for storing and remembering memories over the long term and then they have persistent memory which is as they describe a set of learnable but date independent parameters that encodes the knowledge about a task so it's not short-term memory it's not long-term memory it is just baked into the model itself and they actually give three different variants of the Titan architecture with different trade-offs for each of
them so they incorporate memory as a context a layer and a gated Branch I'll try to explain what those are in a moment now we observe that our Titan architecture outperforms all modern recurrent models as well as their hybrid variants across a comprehensive set of benchmarks and tightens scales to larger than 2 million context window size which is the current state-of-the-art limit of what models are capable of with the Gemini model so the next section is learning to memorize at test time now as a reminder test time means inference time when you prompt a model
and it gives you the response that in between period when it's figuring out what to tell you that is test time and that is the interesting part because at test time it needs to happen really quickly and again we reviewed that paper a few weeks ago called test time training which is kind of using the same technique it's basically updating the model itself at inference time at test time so we present a neural long-term memory module which is a meta models that learn to memorize at test time so first long-term memory it encodes the abstraction
of the past history into its parameters so again it's like a long-term memory you don't remember every single major and minor detail of a memory over the long term that is just not how the human brain works and so again they're trying to mimic the way the human brain works in these new Titan models and so it's an abstraction it is a rough picture of something that happened over the long term memorization however has almost always been known as an undesirable phenomenon in neural networks as it limits the model generalization so if it memorizes everything
it is less capable of generalizing that is why knowing what to memorize is an important feature of these models it also causes privacy concern so if you memorize everything then you'll run into IP issues you'll disclose private information that shouldn't be disclosed and it results in poor performance at test time now my favorite part the surprise metric so as discussed earlier an event that violates the expectations I love that phrase basically it's called surprise you expect something and something happens that is not within your expectations you're surprised and thus that is something to memorize so
it is more memorable for humans inspired by this a simple definition of surprise for a model can be its gradient with respect to the input how different is it from what it was expecting the larger the gradient is the more different the input data is from the past data I mean it's pretty straightforward the surprise metric however can result in missing important information that comes after a big surprising moment so if you put too much attention on a surprising moment then things that occur after it you may not remember again think about humans so from
the human memory perspective an event might not consistently surprise us through a long period of time although it is memorable the reason is that the initial moment is surprising enough to get our attention so we like we talked about through a long time frame leading to memorizing the entire time frame to improve the above surprise metric we break the surprise metric into past surprise and momentary surprise so past surprise measures the surprise amount of a very recent past and momentary surprise which measures the surprise of incoming data so it is the new surprise versus the
surprise that just happened but we also need a way way to give the model the ability to forget it can't just memorize everything that's when you run into poor quality so it needs a forgetting mechanism when dealing with very large sequences millions of tokens it is crucial to manage which past information should be forgotten to this end we use an Adaptive forgetting mechanism that allows the memory to forget the information that is not needed anymore and so this forgetting mechanism takes a few things into consideration but the gist is it factors in the surprise plus
how much memory it has available and it decides what to forget so in this next section they are going to discuss the different ways that they can incorporate memory and there are three and each comes with trade-offs as I mentioned earlier so first memory as a context and this stuff is very complicated so I actually got help from AI to explain it with analogies so first memory as context so think of Mac Mac like having a personal assistant in a meeting who takes detailed notes of past discussions long-term memory Whispers relevant information in your ear
when needed retrieval and then helps you combine both past knowledge and current discussion to make decisions it's basically like having someone record a bunch of memories for you and then tell you that memory as context to make a decision next we have memory as gate mag so in this implementation we have two advisors in your head one advisor focuses only on what's happening right now the short-term attention another advisor draws from years of experience long-term memory and then you have a gatekeeper that decides how much to listen to each advisor so one that is just
hyperfocused on whatever is happening in the moment and then another adviser that looks at the history only and doesn't look at what's happening in the current moment and then you have this third person who's deciding how much of each to use in whatever decision you need to make and then you have memory as a layer which is different from the other two it is essentially putting information through different layers and each layer is a type of memory so for example the first filter might be processes everything through long-term memory the second filter is looks at
immediate context through attention and then each layer refines the information before passing it into the next now as I said there are trade-offs so let's discuss the different trade-offs so Mac memory as context is best for tasks requiring detailed historical context Mac mag is more flexible and can switch between short and long-term focus and then is the most efficient but slightly less powerful so three implementations maybe you use them all maybe you choose one different tradeoffs that you need to think about if you're implementing something like this and so how do they perform well across
the top you have the different benchmarks that it was tested against so Arc e Arc C Wiki all of these so as we can see up top these are the different types of architectures we've talked about Mamba on this channel before we've talked about TTT test time training and Transformers and here here are the Titan models so if we look across the board pretty much the Titan models win it is really that simple and this is for a 340 million parameter model then you have the 400 million and 700 million parameter models and as you
can see the ones that are highlighted performed best across the different benchmarks and Titans did incredibly well now next the needle in the Hy stack test so that just means when you have really long cont context windows can it remember and retrieve things that are deep in that context window without forgetting or getting messed up so what we see here on the y- AIS is the accuracy in terms of retrieval from a long context and then on the x- axis we have the sequence length now as you can see here's gb4 and gbt 40 mini
in Gray these two gray ones and then with the Titans Mac across the top as the context length increases it stays pretty consistent in performance but the other models drop quickly especially some of these other models right here and this is for a few shot setup and then here's a fine-tuning setup and again as you can see the Titans models across the board outperform the other different architectures in terms of needle and a hay stack all right so the conclusion we present a neural long-term memory that as a meta in context learner learns to memorize
at test time so it is a recurrent model in nature and adaptively memorizing tokens that are more surprising in and are close to surprising tokens I absolutely love that intuition that the researchers had our experimental evaluation on diverse tasks validate that Titans are more effective than Transformers and recent modern linear recurrent models specifically for long context and so that's it that is the way that they are proposing to give models better long-term memory with this surprise mechanism I find it so fascinating and congratulations to the author of this paper all link the paper in the
description below and if you enjoyed this video please consider giving a like And subscribe and I'll see you in the next one