Stanford CS25: V2 I Introduction to Transformers w/ Andrej Karpathy

692.35k views12771 WordsCopy TextShare

Stanford Online

January 10, 2023 Introduction to Transformers Andrej Karpathy: https://karpathy.ai/ Since their int...

Video Transcript:

hi everyone welcome to cs25 Transformers United video this was a course that was solid Stanford in the winter 2023 this course is not about robots that can transform into cars as this picture might suggest rather it's about deep learning models that have taken the World by the storm and have revolutionized the field of AI and others starting from natural language processing Transformers have been applied all over from computer vision enforcement learning biology robotics Etc we have an exciting set of videos lined up for you with some truly fascinating speakers skip talks presenting how they're applying

Transformers to the research in different fields and areas we hope you'll enjoy and learn from these videos so without any further Ado let's get started this is a purely introductory lecture and we'll go into the building blocks of Transformers so first let's start with introducing the instructors uh so for me I'm currently on a temporary data Pro from the PHD program and I'm leading AI to robotic startup that I've added robotics so we're working on some general purpose robots uh somewhat like customer and um yeah I'm very passionate about Robotics and building assistive learning algorithms

my research interests are in daily personal learning competitions in Remodeling and I have a bunch of Publications in a robot take part numbers driving on the areas um underground deposit Cornell if somebody spoke for now so nice to meet you so I'm Stephen uh from the first year CSP school here uh Pierce did my masters at CMU and undergrad I wanted mainly into NLP research anything involving language and text um but more recently I've been getting more into computer vision as well and just some stuff I do for fun a lot of music stuff mainly

piano some self promo but I post a lot on my insta YouTube and Tick Tock so if you guys want to check it out um my friends and I are also starting a Stanford piano books if anybody's interested feel free to email the DM me for details um other than that you know martial arts bodybuilding huge fan of k-dramas anime occasional gamer my name is Rylan um instead of talking about myself I just went super excited the last one was off sorry to teach this about excuse me I took it the last time it was

offered I did a bunch of fun um I thought we brought in a really great uh group of speakers last time I'm super excited for this offering um and yeah I'm thankful you're all here and I'm looking forward to a really fun quarter again thank you yeah it's a fun fact Brandon was the most outspoken student last year and it's like someone wants to become instructed next year yeah okay cool um so what we hope you will learn in this class is first of all how do Transformers work um how they're being applied yes definitely

and nowadays like we are pretty much using them everywhere in yeah machine learning um and lots of some new interesting directions of research uh in this topics uh cool so this class is just an introductory so we're just talking about the basics of Transformers introducing them talking about the self attention mechanism on which they're founded and uh we'll do a deep dive more on like uh models like birds GPD so get happy to get solid um okay so let me start with presenting the attention timeline uh attention all started with this one paper transform attention

is already at l in 2017 that was the beginning of Transformers um before that we had the previous historic error where we had models like uh RNN lstms um and their simple attention mechanisms that will possible therefore uh starting 2017 we solved this uh explosion of Transformers into NLP where people started using it for everything I even heard the support from Google it's like a performance increase every time you fight our linguists uh for the first 2018 after 2018 to 2020 we saw this explosion of Transformers into other few like question um a bunch of

other stuff and uh like biology Alpha folded and in last year 2021 was the start of alternative era where we got like a lot of opportunity modeling uh started like almost like kodaks uh GPT Davi stable diffusion server a lot of like things happening in uh genetic performance and uh and so we actually started getting up in here um and now is the present so this is uh 2022 and like the startup 2013 and now we have models like uh chatricity whisper um a bunch of others and we are like scaling onwards uh without spelling

now so that's great uh so that's the future so going more into this um so once there were elements so we have people specific models lstns giu uh what worked here was that they're good at encoding history but what did not work was they didn't and for long sequences and they were very bad at encoding context uh so consider this example consider trying to predict the last word uh in the text I grew up in France dot dot dot I speak fluent Dash here you need to understand the context for it to predict French and

uh like attention mechanism is very good at that whereas if they're just using lstms uh it doesn't work that well uh another thing Transformers are good at is more based on content is like also context prediction is like uh finding uh attention Maps if I have uh something like a word like it what noun does it connect to and we can give a like a probability attention on what are the possible activations um and this works with better than uh existing mechanisms okay uh so where we were in 2021 we were on the verge of

takeoff we were starting to realize the potential of Transformers in different fields uh we solved a lot of long sequence problems like uh protein folding Alpha fold of NRL uh we started to see three shots zero short generalization we saw multimodal tasks and applications like generating images from language so that was Dali um uh yeah and it feels like Asian but personally like two years ago and this is also a talk on Transformers that you can watch on YouTube uh yeah cool um and this is where we was going from 2021 to 2022 which is

we have gone from the virtual thinking of actually taking off and now we are seeing unique applications and audio generation art music storytelling we are starting to see reasoning capabilities like Common Sense uh logical reasoning mathematical reasoning we are also able to now get human Enlightenment and interaction they're able to use a investment learning and human feedback that's how activity is trained to perform really good uh we have a lot of mechanisms spoken during toxicity bias in uh ethics now and there are a lot of also a lot of developments in other areas like division

models cool uh so the feature is a spaceship and we're all excited about it um and there's a lot of more applications that we can enable um and uh it'll be great uh if you can see Transformers also work there uh one big example is with your understanding and generation that is something that everyone is interested in and I'm hoping we'll see a lot of models in this area this year uh also Finance uh business um I'll be very excited to see gbt author novel uh but we need to solve very long sequence modeling and

the most transformative models are still limited to like four thousand tokens or something like that so we need to do a make them generalize much more better on Long sequences uh we are also we also want to have generalized agents that can do a lot of multi-task uh automatic input uh predictions like gato and uh so I think we will see more of that too and finally um we also want domain specific models so you might want like a GPD model let's put it like maybe like you have so that could be like a doctor

GPT model you might have like a large GPT model that's like going on only on Law data So currently we have like gbd models that have trained on everything but we might start to see more Niche models that are like good at one task and we could have like a mixture of experts so like you can think like this is a like how you normally consult an expert will have like expert AI models and you can go to a different air model for your different needs um there's still a lot of missing ingredients uh to

make this all successful uh the the first of all is external memory uh we are already starting to see this with the models like chantipity where uh the infections are short-lived there's no long-term memory and they don't have ability to remember that our store conversations for long term uh and uh this is something we want to fix uh second hour second is reducing the computation complexity so attention mechanism is quadratic over the sequence plan which is slurred and uh we want to radio ceramic faster uh another thing you want to do is we want to

enhance the controllability of these models like a lot of the smalls can be stochastic and we want to be able to control what sort of outputs we get from them um and you might have experienced the 10 GPT which is refresh you get like different output each time but you might want to have a mechanism that controls what sort of things you can uh and finally we want to align our state of art language models with how the human brain works and uh we are seeing the search but we still need more research on seeing

how it can be more important okay thank you great bye uh yes I'm excited to be here uh I live very nearby so I got the invites to come to class and I was like okay I'll just walk over um but then I spent like 10 hours on those slots so it wasn't as simple uh so yeah I want to talk about uh Transformers I'm going to skip the first two over there we're not going to talk about those we're going to talk about that one just to simplify the lecture since we've done a Time

um okay so I wanted to provide a little bit of context of why does this Transformers class even exist so a little bit of historical context I feel like Bilbo over there I joined uh like telling you guys about this um and I think I saw a little drinks uh and basically I joined AI in roughly 2012 simple course so maybe a decade ago and back then you wouldn't even say that you joined AI by the way that was like a dirty word uh now it's okay to talk about but back then it was not

even deep learning it was machine learning that was a term reviews uh if you were serious but now um now ai is okay to use I think uh so basically do you even realize how lucky you are potentially entering this area in roughly going through three uh so back then in 2011 or so when I was working specifically on computer vision um your your pipelines looked like this um so you wanted to classify some images uh you would go to a paper and I think this is representative you would have three pages in the paper

describing all kinds of a zoo of kitchen sink of different kinds of features descriptors and you would go to um poster session and in computer Vision conference and everyone would have their favorite feature descriptors that they're proposing and it's totally ridiculous and you would take notes on like which one you should incorporate into your pipeline because you would extract all of them and then you would put an svm on top so that's what you would do so there's two pages make sure you get your histograms your ssims your color histograms textiles tiny images and don't

forget the geometry specific histograms all of them have basically complicated code by themselves so you're collecting code from everywhere and running it and it was total nightmare so that on top of that it also didn't work so this would be I think representative prediction from that time uh you would just get predictions like this once in a while and you'd be like you just shrub your shoulders like that just happens once in a while today you would be looking for a bug um and worse than that every single um every single sort of feel of

every single chunk of AI had their own completely separate vocabulary that they work with so if you go to an if you go to NLP papers those papers would be completely different so you're reading the NLP paper and you're like what is uh this part of speech tagging morphological analysis syntactic parsing co-reference resolution what is mpbt KJ and you're confused so the vocabulary and everything was completely different and you couldn't read papers I would say across different areas so now that changed a little bit starting 2012 when uh you know askewsky and the colleagues basically

demonstrated that if you scale a large neural network on large data set you can get very strong performance and so up till then there was a lot of focus on algorithms but this showed that actually neural Nets scale very well so you need to now worry about compute and data and you can scale it up it works pretty well and then that recipe actually did copy paste across many areas of AI so we start to see a neural networks pop up everywhere since 2012. um so we saw them in computer vision and NLP and speech

and translation in RL and so on so everyone started to use the same kind of modeling toolkit model framework and now when you go to NLP and you start reading papers there in machine translation for example uh this is a sequel to sequence to paper which we'll come back to in a bit you start to read those papers and you're like okay I can recognize these words like there's a neural network there's some parameter there's an Optimizer and it starts to read like things that you know of so that's a decreased tremendously the barrier to

entry across the different areas and then I think the big deal is that when the Transformer came out in 2017 it's not even that just the toolkits and the neural networks were similar is that literally the architecture is converged to like one architecture that you copy paste across everything seemingly so this was kind of an unassuming uh machine translation paper at the time proposing the Transformer architecture but what we found since then is that you can just uh basically copy paste this architecture and use it everywhere and what's changing is the details of the data

and the chunking of the data and how you feed them and you know that's a caricature but it's kind of like a correct first order statement and uh so now papers are even more similar looking because everyone's just using Transformer and uh so this convergence is was remarkable to watch and unfolded the last decade and it's pretty crazy to me what I found kind of interesting is I think this is some kind of a hint that we're maybe converging to something that maybe the brain is doing because uh the brain is very homogeneous and uniform

across the entire sheet of your cortex and okay maybe some of the details are changing but those build like hyper parameters but like a Transformer but your auditory cortex and your visual cortex and everything else looks very similar and so maybe we're converging to some kind of a uniform powerful learning algorithm here uh something like that I think is kind of interesting okay so I want to talk about where the Transformer came from very Point historically so I want to start in 2003 I like this paper uh quite a bit it was the first sort

of um popular application of neural networks to the problem of language modeling so predicting in this case the next word in the sequence which allows you to build generative models over text and in this case they were using multi-layer perceptron so very simple neural web the neural Nets took three words and predicted the probability distribution for the fourth word in a sequence uh so this was uh well and good at this point now over time people started to apply this to a machine translation so that brings us to sequence to sequence paper from 2014 that

was pretty influential and the big problem here was okay we don't just want to take three words and predict the four we want to predict how to go from an English sentence to a French sentence and the key problem was okay you can have arbitrary number of words in English an arbitrary number of words in French so how do you get an architecture that can process this variably sized input and so here they use a lstm and there there's basically two chunks of this which are covered by the slack by the um by this but

basically have an encoder lstm on the left and it just consumes uh the work one word at a time and builds up a context of what it has read and then that acts as a conditioning Vector to the decoder RNN or lstm that basically goes chong chong chunk for the next word in the sequence uh translating the English to French or something like that now the big problem with this that people identified I think very quickly and tried to resolve is that there's what's called this um encoder bottleneck so this entire English sentence that we

are trying to condition on is packed into a single Vector that goes from the encoder for the decoder and so this is just too much information to potentially maintain a single vector and that didn't seem correct and so people are looking around for ways to alleviate the attention of sorry the um encoded bottleneck as it was called at the time and so that brings us to this paper neural machine translation by jointly learning to align and translate and that here just moving from the abstract in this paper we conjectured that the use of a fixed

lung Vector is a bottleneck in improving the performance of the basic encoder decoder architecture and propose to extend this by allowing the model to automatically soft search for parts of the source sentence that are relevant to predicting a Target word um yeah without having to form these parts or hard segments explicitly so this was a way to look back to the words that are coming from the encoder and it was achieved using this soft search so as you are decoding in the the words here while you are decoding them you are allowed to look back

at the words at the encoder via this soft uh attention mechanism proposed in this paper and so this paper I think is the first time that I saw basically uh attention um so your context Vector that comes from the encoder is a weighted sum of the Hidden states of the words in the um in the encoding and then the weights of this sum come from a soft Max that is based on these compatibilities between the current state as you're decoding and the hidden States uh generated by the encoder and so this is the percent that

really you start to like look at it and and this is the current modern equations of the attention and I think this was the perspective that I saw it in it's the first time that uh there's a word attention used as far as I know to call this mechanism so actually try to dig into the details of the history of the attention uh so the first author here Dimitri I um I had an email correspondence with him and I basically sent him an email I'm like Dimitri this is really interesting Transformers have taken over where

did you come up with the self-detension mechanism that ends up being the heart of the Transformer and uh to my surprise he wrote me back this like massive email which was really fascinating so this is an excerpt from that email um so basically he talks about how he was looking for a way to avoid this bottleneck between the encoder and decoder he had some ideas about the cursors that traversed the sequences that didn't quite work out and then here so one day I had this thought that it would be nice to enable the decoder RNN

to learn to search where to put the cursor on the source sequence this was sort of inspired by translation exercises that um learning English in my middle school involved you gaze shifts back and forth between source and Target sequence as you translate so literally I thought this was kind of interesting that he's not a native English speaker and here that gave him an edge in this machine translation that read to led to attention and then led the Transformer so that was that's really fascinating um I expressed a soft search a soft Max and then weighted

averaging of the bironial states and basically uh to my great excitement this word this word from the very first try so really I think interesting piece of history and as it later turned out that the name of RNN search was uh kind of lame so the better name attention came from Yeshua on one of the final passes as they went over the paper so maybe attention is all you need would have been called like RNs or just but we have Yeshua NGO to thank for a little bit of a better name I would say so

apparently that's the the history of this okay so that brings us to 2017 which is attention is all you need so this attention component which in uh Dimitri's paper was just like one small segment and there's all this bi-directional RNN RNN and decoder and this Tesla modeling paper is saying okay you can actually delete everything like what's making this work very well is just detection by itself and so delete everything keep attention and then what's remarkable about this paper actually is usually you see papers that are very incremental they add like one thing uh and

they still get it's better but I feel like attention is all in need was like a mix of multiple things at the same time they were combined in a very unique way and then also achieved a very good local minimum in the architecture space and so to me this is a really a landmark paper that um and it's quite quite remarkable and I think had quite a lot of work behind the scenes um so delete all the RNN just keep attention because attention is operates over sets and I'm going to go into this in a

second you now need to positionally encode your inputs because attention doesn't have the notion of space right so um I should be very careful um they adopted this residual Network structure from resonance uh they interspersed attention with multilio perceptrons they they use layer Norms which came from a different paper they introduced the concept of multiple heads of attention that were applied in parallel and they gave us I think like a fairly good set of hyper parameters that to this day are used so the expansion factor in the multiple perceptron goes up by 4X so we'll

go into like a bit more detail and this borax has stuck around and I believe like there's a number of papers that tried to play with all kinds of little details of the Transformer and nothing like sticks because this is actually quite good the only thing to my knowledge that stock that that didn't stick was this uh we shuffling up the layer Norms to go into the pre-norm uh version where here you see the layer Norms are after the multi-headed detection RP forward but they just put them before instead so just reshuffling of layer Norms

but otherwise the gpts and everything else that you're seeing today is basically the 2017 architecture from five years ago and even though everyone is working on it it's proven remarkably resilient which I think is uh there are innovations that I think have been adopted also in positional encodings it's more common to use different rotary and relative positional encodings and so on uh so I think there have been changes but for the most part it's proven very resilient so really quite an interesting paper now I wanted to go into the attention mechanism um and I think

I sort of like the way I interpret it is not is not similar to the ways that I've seen it presented before so let me try a different way um of like how I see it basically to me attention is kind of like the communication phase of the Transformer and the Transformer interludes two phases uh the communication phase which is the multi-headed attention and the computation stage which is uh this multilio perceptron or P1 so in the communication phase uh it's really just a data dependent message passing on directed graphs and you can think of

it as okay forget everything with a machine translation and everything let's just we have directed graphs at each a node you are storing a vector and then um let me talk now about the communication phase of how these vectors talk to each other in this directive graph and then the compute phase later is just the multiple perceptron which now which then um basically acts on every note individually but how do these notes talk to each other in this directed graph so I wrote like some simple python uh like I wrote this in Python basically to

to create one round of communication of uh using attention as the uh direct as the um Master passing scheme so here a node has this private data Vector as you can think of it as private information to this node and then it can also emit a key a query and a value and simply that's done by a linear transformation uh from this node so the key is um what are the things that I am um sorry the query is one of the things that I'm looking for the key is where the thing that I have

and the value is one of the things that I will communicate and so then when you have your graph that's made up of nodes and some random edges when you actually have these nodes communicating what's happening is you Loop over all the nodes individually in some random order and you you are at some node and you get the query Vector Q which is common node and some graph and uh this is what I'm looking for and so let's just achieved via this longer transformation here and then we look at all the inputs that point to

this node and then they broadcast what are the things that I have which is their keys so they broadcast the keys I have the query then those interact by dot product to get scores so basically uh simply by doing Dot product you get some kind of a unnormalized weighting well the interestingness of all of the information in my uh in the nodes that point to me and to the things I'm looking for and then when you normalize that with a soft Max so it just sums to one uh you basically just end up using those

scores which now sum to one in our probability distribution and you do a weighted sum of the values uh to get your update so I have a query they have keys uh dot products to get interestingness or like Affinity softmax to normalize it and then we get some of those values flow to me and update me and this is happening for each node individually and then we update at the end and so this kind of a message passing scheme is kind of like at the heart of the Transformer and happens in a more vectorized uh

batched way that is more confusing and is also intersberg with interspersonal layer norms and things like that to make the training uh behave better but that's roughly what's happening in the attention mechanism I think on a high level um so yeah so in the communication phase of the Transformer um then this message passing scheme happens in every head in parallel and then in every layer in series um and with different weights each time and that's the that's that's it as far as the multi-headed tension goes and so if you look at these encoder decoder models

you can sort of think of it then in terms of the connectivity of these nodes in the graph you can kind of think of it as like okay all these tokens that are in the encoder that we want to condition on they are fully connected to each other so in when they communicate they communicate fully when you calculate their features but in the decoder because we are trying to have a language model we don't want to have communication from future tokens because they give away the answer at this step so the tokens in the decoder

are fully connected from all the encoder States and then they are all Simply Connected from everything that is important and so you end up with this like triangular structure in the direct graph but that's the message passing scheme that this basically implements um and then you have to be also a little bit careful because in the cross attention here with the decoder you consume the features from the top of the encoder so um think of it as in the encoder all the nodes are looking at each other all the tokens are looking at each other

many many times and they really figure out what's in there and then the decoder when it's it's looking only at the top notes so that's roughly the message passing scheme I was going to go into more of an implementation of the Transformer I don't know if there's any questions about this group self-advention but what is this yeah so um self protection and multi-headed detection so the multi-headed attention is just this attention scheme but it's just applied uh multiple times in parallel multiple heads just means independent applications of the same attention so this message passing scheme

basically just happens uh in parallel multiple times with different weights for the query key and value so you can almost look at it like in parallel I'm looking for I'm seeking different kinds of information from different nodes that I'm collecting it all in the same note it's all done in parallel so heads is really just like copy paste in parallel and layers are copy paste but in series maybe that makes sense and self-attention when it's self-attention what it's referring to is that the node here um produces each node here so as I described it here

this is really soft ocean because every one of these nodes produces a key query and a value from this individual node when you have cross attention you have one cross attention here um coming from the encoder that just means that the queries are still produced from this node but the piece and the values are produced as a function of nodes that are coming from the encoder so um I have my queries because I'm trying to decode some the fifth word in the sequence and I I'm looking for certain things because I'm the fifth word and

then the keys and the values in terms of the source of information that could answer my queries can come from the previous nodes in the current decoding sequence or from the top of the encoder so all the nodes that have already seen all of the all of the encoding tokens many many times cannot broadcast what they contain in terms of information so I guess to summarize the self-attention is kind of like sorry cross attention and self-attention only differ in where the piece and the values come from either the keys and values are produced from this

node or they are produced from some external Source like like an encoder and the notes over there but algorithmically is the is the same Michael operations okay so yeah so um um so think of um so each one of these nodes is a token um I guess like they don't have a very good picture of it in the Transformer but like um like this node here could represent the uh third word in the output in the decoder and um in the beginning it is just the embedding of the word um and then um okay I

have to think through this knowledge a little bit more I came up with it this morning one example substantiation uh these notes are basically the factors um I'll go to an implementation I'll go to the implementation and then maybe I'll make the connections uh to the graph so let me try to first go to let me not go to with this intuition in mind at least uh to Nano GPT which is a complete implementation of a transformer that is very minimal so I worked on this over the last few days and here it is reproducing

gpt2 on open webtext uh so it's a pretty serious implementation and reproduces gpd2 I would say and uh provided in the computer this was one node of atpus for 38 hours or something like that remember correctly and it's very readable it's 300 lives so everyone can take a look at it um and uh yeah let me basically briefly step through it so let's try to have a decoder only Transformer so what that means is that it's a language model it tries to model the um the next word in the sequence or the next character in

the sequence so the data that we train on is always some kind of text so here's some fake Shakespeare sorry this is real Shakespeare we're going to produce a big Shakespeare so this is called the tiny Shakespeare data set which is one of my favorite toy data sets you take all of Shakespeare concatenate it and it's one megawatt file and then you can train language models on it and get infinite Shakespeare if you like which I think is kind of cool so we have a text the first thing we need to do is we need

to convert it to a sequence of integers because Transformers natively process um you know um you can plug text into Transformer you need to some out encoded so the way that encoding is done is we convert for example in a simplest case every character gets an integer and then instead of hi there we would have this sequence of integers so then you can encode every single um character as an integer and get like a massive sequence of integers you just concatenate it all into one large long one-dimensional sequence and then you can train on it

now here we only have a single document in some cases if you have multiple independent documents what people like to do is create special tokens and they intersperse those documents with those special end of text tokens uh that they splice in between to create boundaries but those boundaries actually don't have any um any modeling impact it's just that the Transformer is supposed to learn via back propagation that the end of document sequence means that you should wipe the memory okay so then we produce batches so these batches of data just mean that we go back

to the one-dimensional sequence and we take out chunks of this sequence so say if the block size is eight uh then the block size indicates the um maximum length of context that your Transformer will process so if our block size is eight that means that we are going to have up to eight characters of context to predict the ninth character in a sequence and the batch size indicates how many sequences in parallel we're going to process and we want this to be as large as possible so we're fully taking advantage of the GPU and the

parallels under the chords so in this example we're doing a four by eight batches so every row here is independent example sort of and then every um every uh every a row here is a is a small chunk of the sequence that we're going to train on and then we have both the inputs and the targets at every single point here so to fully spell out what's contained in a single four by eight batch to the Transformer uh I sort of like compact it here so when the input is 47 by itself the target is

58 and when the input is the sequence 4758 the target is one and when it's 47.581 the target is 51 and so on so actually the single batch of examples that score by eight actually has a ton of individual examples that we are expecting the Transformer to learn on in um in parallel and so you'll see that the batches are learned on completely independently but the uh the time Dimensions sort of here along horizontally is also trained on in parallel so sort of your your real back size is more like B times d it's just

that the context grows linearly for the predictions that you make along the T Direction um in the in the model so this is how the this is all the examples of the model will learn from this single back so now this is the uh GPT class and uh because this is a decoder only model um so we're not going to have an encoder because there's no like English We're translating from we're not trying to condition on some other external information we're just trying to produce a uh sequence of words that follow each other are likely

to so this is all pytorch and I'm going slightly faster because I'm assuming people have taken 231n or something along those lines um but here in the forward pass we take this uh these indices and then we both encode the identity of the indices just via an embedding lookup table so every single integer has a uh we index into a lookup table of vectors in this nn.embedding and pull out the the um word Vector for that token and then because the message because transformed by itself doesn't actually it processes sets natively so we need to

also positionally encode these vectors so that we basically have both the information about the token identity and its place in the sequence from one to block size now those uh the information about what and where is combined additively so the token embeddings and the positional embeddings are just added exactly as here so this x here uh then there's optional Dropout this x here basically just contains the set of uh words um and their positions and that feeds into the blocks of Transformer and we're going to look into what's blocked here but for here for now

this is just a series of blocks in a Transformer and then in the end there's a layer norm and then you're decoding the logits for the next um word or next integer in a sequence using a linear projection of the of the output of this Transformer so LM head here a shortcore language model head is just a linear function uh so basically positionally encode all the words feed them into a sequence of blocks and then apply a linear layer to get the probability distribution for the next uh character and then if we have the targets

which we produced in uh the data loader and you'll notice that the targets are just the inputs offset by one in time then those targets feed into a cross-entropy loss so this is just a negative one likelihood typical classification loss so now let's drill into what's here in the blocks uh so these blocks that are applied sequentially um there's again as I mentioned communicate base and the compute base so in the communicate phase all the nodes get to talk to each other and so these nodes are basically if our block size is eight then we

are going to have eight nodes in this graph there's eight notes in this graph the first node is pointed to only by itself the second node is pointed to by the first node and itself the third node is pointed to by the first two nodes and itself Etc so there's eight nodes here so you apply there's a residual pathway in X you take it out you apply a layer norm and then the cell potential so that these communicate these eight nodes communicate but you have to keep in mind that the dash is four so because

batches four this is also applied uh so we have eight nodes communicating but there's a batch of four of them all individually communicating one those eight nodes there's no crisscross across the fashion motion of course there's no bash online um and then once they change information they are processed using the multilio perceptron and that's the compute phase so um and then also here we are missing we are missing the cross attention um and uh because this is a decoder only model so all we have is this step here the multi-headed retention and that's the Slime

the communicate phase and then we have the feed forward which is the MLP and that's the complete phase I'll take I'll take questions a bit later then the MLP here is fairly straightforward the MLP is just individual processing on each node um just transforming the feature representation sort of at that node so um applying a two-layer neural net with a gallery nonlinearity which is just think of it as a relu or something like that it's just anonymity and then MLP straightforward I don't think there's anything too crazy there and then this is the causal soft

retention part the communication phase so this is kind of like the neat of things and the most complicated part it's only complicated because of the batching and the implementation detail of how you mask the connectivity in the graph so that you don't you can't obtain any information from the future when you're predicting your token otherwise it gives away information so if I'm the fifth token and um if I'm the fifth position then I'm getting the fourth token coming into the input and I'm attending to the third second and first and I'm trying to figure out

what is the what is the next token well then in this batch in the next element over in the time Dimension the answer is at the input so I can't get any information from there so that's why this is all tricky but basically in the forward pass um we are calculating the the queries keys and values based on x so these are the key squares and values here when I'm Computing the attention I have the queries The Matrix multiplying the keys so this is the dot product in parallel for all the queries and all the

keys in all the heads so that I mentioned I thought to mention that there's also the aspect of the heads which is also done old in parallel here so we have the batch Dimension the time Dimension and the head Dimension and you end up with five dimensional tensors and it's all really confusing so I invite you to step through it later and convince yourself that this is actually doing the right thing but basically basically you have the batch Dimension the head Dimension and the time Dimension and then you have features at them and so uh

this is evaluating for all the batch elements for all the head elements and all the time elements the simple python that I gave you earlier which is query.proc T then here we do a mask still and what this is doing is it's basically clamping the um the attention between the nodes that are not supposed to communicate to be negative infinity and we're doing negative Infinity because we're about to soft Max and so negative Infinity will make basically the attention that those elements be zero and so here we are going to basically end up with uh

the weights um the the sort of affinities between these nodes optional Dropout and then here attention Matrix multiply V is basically the um the Gathering of the information according to the affinities we've calculated and this is just a weighted sum of the values at all those nodes so this Matrix multiplies as doing that we could sum and then transpose contiguous view because it's all complicated and bashed in five dimensional testers but it's really not doing anything optional Dropout and then a linear projection back to the residual pathway so this is implementing the communication phase here

then you can train this Transformer um and then you can generate uh infinite Shakespeare and you will simply do this by because our block size is eight we start with a sum token um say like I use in this case um you can use something like a new line As the start token and then um you communicate only to yourself because there's a single node and you get the prompted distribution for the first word in the sequence and then um you decode it or the first character in the sequence you decode the character and then

you bring back the character and you re-encode it as an integer and now you have the second thing and so you you get okay we're in the first position and this is whatever integer it is add the positional encodings goes into the sequence that goes into Transformer and again this token now communicates with the first token um and its identity and so you just keep plugging the doc and once you run out of the block size which is eight you start to crop because you can never have block size more than eight in the way

you've trained this Transformer so we have more and more context until eight and then if you want to generate Beyond date you have to start cropping because the Transformer only works for eight elements in time Dimension and so all of these Transformers in the naive setting have a finite block size or context line and uh in typical models this will be 1020 Pro tokens or 2048 tokens something like that but these tokens are usually like epe tokens or sentence piece tokens or workpiece tokens there's many different encodings so it's not like that long and so

that's why I think did mentioned we really want to expand the context size and gets gnarly because the attention is quadratic in any case now if you want to um Implement an encoder instead of a decoder attention then all you have to do is this master you just delete that line so if you don't mask the attention then all the nodes communicate to each other and everything is allowed and information flows between all the nodes so if you want to have the encoder here uh just delete all the encoder blocks will use attention where this

line is deleted that's it so you're allowing um whatever this encoder in my store say 10 tokens like 10 nodes and they are allowed to communicate to each other going up the Transformer and then if you want to implement cross attention so you have a full encoder decoder Transformer not just a decoder only Transformer or GPT then we need to also add uh cross attention in the middle so here there's a self-attention piece where all the there's a cell potential piece across the tension piece and this LLP and in the cross attention uh we need

to take the features from the top of the encoder we need to add one more line here and uh this would be the cross attention uh instead of I should have implemented it instead of just pointing I think but um there will be a cross attention line here so we'll have three lines because we need to add another block and the queries will come from X but the keys and the values will come from the top of the encoder and there will be basically information flowing from the encoder strictly to all the nodes inside X

and then that's it so it's a very simple sort of modifications on the decoder attention so you'll you'll hear people talk that you kind of have a ticket or only model like GPT you can have an encoder only model like Bert or you can have an encoder decoder model like say T5 doing things like machine translation so um and in Bert uh you can't train it using sort of this um language modeling setup that's Auto aggressive and you're just trying to predict the next element in the sequence your training into a slightly different objectives you're

putting in like the full sentence and the full sentence is allowed to communicate fully and then you're trying to classify sentiment or something like that so you're not trying to model like the next token in the sequence uh so these are trained slightly different with masked uh with uh using masking and uh other denoising tables okay so that's kind of like the transformer I'm going to continue so yeah maybe more questions foreign so I'm not sure if I fully follow so there's different ways to look at this analogy but one analogy is you can interpret

this graph as really fixed it's just that every time we do the communicate we are using different ways you can look at it so if we have block size of eight in my example we would have eight notes Here we have two four six okay so we'd have eight notes they would be connected in in you lay them out and you only connect from left to right now um uh why would the connect usually the connections don't change as a function of the data or something like this route any I don't think I've seen a

single example where the connectivity changes Dynamic and option of data usually the connectivity is fixed if you have an encoder and you're training a birch you have how many tokens you want and they are fully connected and if you have uh you have this triangular thing and if you have encoder decoder then you have awkwardly sort of like two tools of nodes and yeah thank you my question um rich the report okay yeah it's really hard to say so that's why I think this paper is so interesting is like yeah usually you'd see like a

path and maybe they had path internally they just didn't publish it and all you can see is sort of things that didn't look like a Transformer I mean you had resnets which had lots of this but a resnet would be kind of like this but there's no um there's no self-attention component but the MLP is there kind of in a resnet um so a resnet looks very much like this except there's no uh you can use layer Norms in resonance I believe as well typically sometimes they can be bash norms so it is kind of

like a resnet it is kind of what they took a resnet and they put in a Cell potential one in addition to the pre-existing MLP block which is kind of like convolutions and MLP was strictly speaking the convolution one by one combination um but I think the idea is similar in that MLP is just kind of like a you know typical weights uh nonlinearity weights or operation um and but I will say like yeah it's kind of interesting because a lot of work is is not uh is not there and then they give you this

Transformer and then it turns out five years later it's not changed even though everyone's trying to change it so it's kind of interesting to me that it's kind of like a package in like a package uh which I think is really interesting historically and um I also talked to a paper authors and they were unaware of the impact that the transform was half at the time so when you read this uh paper actually uh it's it's kind of unfortunate because this is like the paper that changed everything but when people read it it's like question

marks because the reason like a pretty random machine translation paper like oh we're doing machine translation oh here's a cool architecture okay great good results like it's it doesn't sort of know what's going to happen and so when people read it today I think they're kind of confused uh potentially like having like having I will have some tweets at the end but I think I would have renamed it with the benefit of hindsight of like well I'll get to it [Music] yeah I think that's a good question as well uh currently I mean I certainly

don't love the auto aggressive modeling approach I think it's kind of weird to like sample a token and then commit to it uh so you know maybe there's some ways to um uh maybe there's some ways some hybrids with diffusion as an example which I think would be really cool um or we'll find some other ways to like edit the sequences later but still in the auto regressive framework um but I think the fusion is kind of like an up-and-coming modeling approach that I personally find much more appealing when I sample text I don't go

chunk chunk chunk and commit I uh I do a draft one and then I do a better draft two and that feels like a diffusion process so that would be my oh Works kind of like a confusing term because um I mean yeah previously there was this notion of uh I kind of thought maybe today everything is a graphical Network because the Transformer is a graphical Network processor the native representation that the Transformer operates over is sets that are connected by edges uh in a directed way and so that's the native representation and then um

yeah okay I should go on because I still have like 30 slots oh yeah yeah the Rooty I think basically like as your as your if you're initializing with random uh weights separate from a golfing as your dimensional size grows so does your values the variance grows and then your soft Max will just become uh the one half Vector um so it's just a way to control the variance and bring it to always be in a good range with soft Max and nice distribution okay so it's almost like an initialization thing okay uh so Transformers

have been applied to all the other all the other fields and the way this was done is in in my opinion kind of ridiculous ways honestly because I was a computer different person and uh you have comments and it kind of makes sense so what we're doing now with bits as an example is you take an image and you chop it up into little squares and then those squares literally feed into a Transformer and that's it which is kind of ridiculous and so I mean yeah and so the Transformer doesn't even in the simplest case

like really know where these patches might come from they are usually positionally encoded um but it has to sort of like ReDiscover a lot of the structure I think of them in some ways um and it's kind of weird to approach it that way um but uh it's just like this this simple Baseline the simplest Baseline of just chomping up big images into small squares and feeding them in as like the individual notes actually works fairly well and then this is in the Transformer encoder so all the patches are talking to each other throughout the

entire Transformer and the number of nodes here would be sort of like uh nine uh also in speech recognition you just take your Mel spectrogram and you chop it up into little slices and feed them into a Transformer so there was paper like this but also whisper whisper is a copy based Transformer if you saw whisper uh from open AI you just chop up null spectrogram and feed into a Transformer and then pretend you're dealing with text and it works very well uh decision transform on RL you take your States actions and rework that you

experience an environment and you just pretend it's a language and you start to model the sequences of that and then you can use that for planning later that works pretty well uh you know even things like Alpha gold so we were frequently talking about molecules and how you can plot them in so at the heart of alcohol computationally is also a Transformer one thing I wanted to also say about Transformers is I find that they're very uh they're super flexible and I really enjoy that um I'll give you an example from Tesla um like you

have a comment that takes an image and makes predictions about the image and then the big question is how do you feed in extra information and it's not always trivial like say I have additional information that I want to inform uh that I want the outputs to be informed by maybe I have other sensors like radar maybe I have some map information or a vehicle type or some audio and the question is how do you feed information into a comment like where do you feed it in do you concatenate it like how do you do

you add it at what stage and so with the Transformer it's much easier because you just take whatever you want you chop it up into pieces and eat it in with a set of what you had before and you let the self-attention figure out how everything should communicate and that actually apparently works so I just chop up everything and throw it into the mix it's kind of like the way and it frees neural that's from from this uh from this burden of euclidean space where previously you had to um you had to arrange your computation

to conform to the euclidean space of three dimensions of how you're laying out the compute like the compute actually kind of happens in almost like 3D space if you think about it um but in attention everything is just sets uh so it's a it's a very flexible framework and you can just like throw in stuff into your conditioning set and everything just self-attended over so it's quite beautiful I respect it okay so now what exactly makes Transformers so effective I think a good example of this comes from the gbt3 paper which I encourage people uh

to read uh language models are two shot Learners I would have probably renamed this a little bit I would have said something like Transformers are capable of in context learning or like meta learning that's kind of like what makes them really special so basically the the setting that they're working with is okay I have some context and I'm trying to like say passage this is just one example of many I have a passage and I'm asking questions about it and then um I'm giving uh as part of the context in the prompt I'm giving the

questions and the answers so I'm giving one example a question answer another example question answer another example question answer and so on and this becomes a oh yeah people are gonna have to leave okay it's really important for me okay so what's really interesting is basically like uh with more examples given in the context the accuracy improves and so what that can say is that the Transformer is able to somehow learn in the activations without doing any gradient descent in a typical functioning fashion so if you fine-tune you have to give an example and the

answer and you you fine-tune it using gradient descent but it looks like the transformer internally in its weights is doing something that looks like potential ingredients some kind of a metal learning in the weights of the Transformer as it is reading the prompt and so in this paper they go into okay distinguishing this outer loop with stochastic Readiness and this inner loop of the in context learning so the inner loop is as the Transformer is sort of like reading the sequence almost and the outer loop is the uh is the training by gradient descent um

so basically there's some training happening in the activation Transformer as it is consuming a sequence that maybe very much looks like gradient descent and so there's some recent papers that kind of hint at this and study it and so as an example in this paper here they propose something called The Raw operator and um they argue that the raw operator is implemented by Transformer and then they show that you can Implement things like Ridge regression on top of a raw operator and so this is kind of giving um there are papers hinting that maybe there

is some thing that looks like gradient-based learning inside the activations of the Transformer and uh I think this is not impossible to think through because what is what is gradient-based learning forward pass backward pass and an update well that looks like a resonant right because you're just changing you're adding to the weights uh so you start initial random set of Weights forward pass backward pass and update your weights and then forward pass backward positive ways looks like a reset Transformer is a resnet oh so uh much more handwavy but basically some people are trying to

find out why that could be potentially possible and then I have a bunch of tweets I just copy pasted here in the end um I was this was kind of like meant for General consumption so they're a bit more high level and high p a little bit but um I'm talking about why this architecture is so interesting and why why potentially became so popular and I think it's simultaneously optimizes three properties that I think are very desirable number one the Transformer is very expressive in the forward pass it's um it's sort of like is able

to implement very interesting functions potentially functions that can even like do metal learning number two it is very optimizable thanks to things like residual connections layer nodes and so on and number three it's extremely efficient this is not always appreciated but the Transformer if you look at the computational graph is a shallow wide Network which is perfect to take advantage of the parallelism of gpus so I think the Transformer was designed very deliberately to run efficiently on gpus uh there's previous work like neural GPU that I really enjoy as well which is really just like

how do we how do we design your own ads that are efficient on gpus and thinking backwards from the constraints of the hardware which I think is a very interesting way to think about it um oh yeah so here I'm saying I probably would have called um I probably would have called the Transformer a general purpose efficient optimizable computer instead of attention is all you need like that's what I would have maybe in hindsight called uh that paper is proposing a is a model that is um very general purpose so forward pass is expressive it's

very efficient uh in terms of GPU usage it's easily optimizable by gradient and trains very nicely then I have some other hot tweets here um anyway so I yeah you can read them later but I think this one's maybe interesting so uh if previously neural Nets are special purpose computers designed for a specific tasks GPT is a general purpose computer reconfigurable at runtime to run natural language programs uh so the program the programs are given as prompts and then GPT runs the program by completing the document so I really I really like these analogies uh

personally uh to computer it's just like a powerful computer and it's optimizable by doing The Descent um and uh I don't know okay that's it I think you can read this later but uh that's right I'll just thank you I'll just leave this up so sorry I just found this tweet so it turns out that if you scale up the training set and use a powerful enough neural lab like a Transformer the network becomes a kind of general purpose computer over text so I think that's kind of like nice way to look at it and

instead of performing a single thick sequence you can design the sequence in the prompt and because the Transformer is both powerful but also is trained on large enough very hard data set it kind of becomes this general purpose text computer and so I think that's kind of interesting yeah um foreign but now we should be kind of writing years how much do you think there's three really like posts today you know it really is because it's mostly more efficient for Humanity [Music] so I think there's a bit of that yeah so I would say rnn's

like in principle yes they can Implement arbitrary uh programs I think it's kind of like a useless statement to some extent because they are not they're probably I'm not sure that they're they're probably expressive because in a sense of like power in that they can Implement these arbitrary functions but they're not optimizable and they're certainly not efficient because they are serial Computing devices um so I think so if you look at it as a compute graph rnns are very long thin compute graph um like if you stretched out the neurons and you look like take

all the individual neurons interconnectivity and stretch them out and try to visualize them rnns would be like a very long graph in a bad and it's bad also for optimizability because I don't exactly know why but just the rock intuition is when you're back propagating you don't want to make too many steps and so Transformers are shallow wide graph and so from from supervision to inputs is a very small number of hops and it's a long residual Pathways which make gradients flow very easily and there's all these layer Norms to control the the scales of

all all of those activations and so uh there's not too many Hops and you're going from supervision to input very quickly and just flows through the graph so um and it's it can all be done in parallel so you don't need to do this uh encoder decoder RNs you have to go from first then second word then third word but here in in Transformer every single word was processed completely sort of in parallel which is kind of a so I think all these are really important because all these are really important and I think number

three is less talked about but extremely important because in deep learning scale matters and so the size of the network that you can train it gives you uh is extremely important and so if it's efficient on the current Hardware then we can make it bigger foreign no [Music] um so yeah so you take your images and you apparently chop them up into patches so there's the first uh thousand tokens or whatever and now I have a special so radar could be um also but I don't actually want to make a representation of radar so uh

but you could uh you just need to chop it up and enter it and then you have to encode it somehow like the transformer needs to know that they're coming from radar so you create a special you have some kind of a special token that you um are these radar tokens in the representation and it's learnable by gradient and um but vehicle information would also come in with a special embedding token that can be learned um so um because you don't it's all just a set and there's another voice yeah it's all just a set

but you can't positionally encode these sets uh if you want so um but positional encoding means you can hardwire for example the the coordinates like using sinusos and cosines you can even have wire that but it's better if you don't have wire the position you just it's just a vector that is always hanging out the dislocation whatever content is there just adds on it and this vectors trainable by background that's how you do it but they seem to work but it seems like they're sometimes whatever else I might do something better um a question so

I mean the positional encoder is like they're they're actually like not uh they have okay so they have very little inductive bias or something like that they're just vectors hanging out in location always and you're trying to you're trying to help the network um and I think the intuition is is good but um like if you have enough data usually trying to mess with it is like a bad thing uh by trying to well trying to enter knowledge when you have enough knowledge in the data set itself is not usually productive uh so we talked

really depends on what scale you are if you have Infinity data then you actually want to encode less and less that turns out to work better and if you have very little data then actually you do want to encode some biases and maybe if you have a much smaller data set then maybe convolutions are a good idea because you actually have this bias coming from all filters and so but I think um so the Transformer is extremely General but there are ways to mess with the encodings to put in more structure like you could for

example in code sinuses and cosines and fix it or you could actually go to the attention mechanism and say okay my if my image is chopped up into patches this patch can only communicate to this neighborhood and you can you just do that in the attention Matrix just mask out whatever you don't want to communicate and so people really play with this because the the full attention is uh inefficient so they will intersperse for example layers that only communicate in all patches and then layers the community globally and they will sort of do all kinds

of tricks like that so you can slowly bring in more inductive bias you would do it but the inductive biases are sort of like they're factored out from the core Transformer and they are factored out in the uh in the connectivity of the nodes and they are factored out in the positional employees and we can mess with this for proposition s so there's probably about 200 papers on this now uh if not more they're kind of hard to keep track of honestly like my Safari browser which is oh it's on my computer it's like 200

open tabs um but um yes I'm not even sure if I'm if I want to like pick my favorite honestly yeah and even people for Transformer attack season I think it was was a big supplies instructions the other one that I actually like even more is potentially keep the context length fixed but allow the network to somehow use a scratch Pad okay and so the way this works is you will teach the Transformer somehow via examples in the prompt but hey you actually have a scratch Pad hey hey trust you basically you can't remember too

much your contact's not inspired but you can use a scratch pad and you do that by emitting a start scratch pad and then writing whatever you want to remember and then end scratch pad and then uh you continue with whatever you want and then later when it's decoding you actually like have special logic that when you detect start scratch Pad you will sort of like save whatever it puts in there in like external thing and allow it to attend over it so basically you can teach the Transformer just dynamically because it's so um meta-learned you

can teach it dynamically to use other gizmos and gadgets and allow it to expand its memory that way if that makes sense it's just like human learning to use a notepad right you don't have to keep it in your brain so keeping things in your brain is kind of like the context might be the Transformer but maybe we can just give it a notebook and then it can query the notebook and read from it and write to it [Laughter] foreign extensively but I did see a forgetting event and I kind of felt like the block

size was just moved questions so one one question is what do you think about architecture a sport actually okay which one is this one uh the second question it was a personal question um what are you going to work on next I mean so right now I'm working on things like Nano GPT whereas Nano GPT uh I mean I'm going basically slightly from computer vision and like part uh kind of like the immersion-based products to a little bit in language domain uh so originally I had Min GPT whichever wrote to Nano GPT and I'm working

on this I'm trying to reproduce gpts and I mean I think something like Chachi PT I think um incrementally improved in the product fashion would be extremely interesting and uh I think a lot of people feel it and that's why it went so wide so I think there's uh something like a Google plus plus to build that I think is really interesting can we give our speed run