Gated Recurrent Unit | GRU | Explained in detail

2.21k views2478 WordsCopy TextShare

Learn With Jay

Gated Recurrent Unit (GRU) is a type of Recurrent Neural Network (RNN) which performs better than Si...

Video Transcript:

there are some limitations to simple RNN and gated recurrent unit helps us to overcome them gr is like a simpler alternative to lstm I've already made a video on lstm you can go check that out in this video we're going to talk about the limitations of simple RNN how Gru helps us to overcome them and we are also going to look into the in-depth working of Gru so let's get started let's say you want to make a grammar checker which checks if the sentence is grammatically correct or not if you read this sentence then it

says that Mr Watson goes on a business trip every Wednesday he always takes a flight to New York and on every trip her wife wishes him good luck now clearly this is grammatically incorrect because it should be him here not her if you want to make a grammar checker then simple RNN might not be the best option here because simple RNN struggles to perform well when dealing with long sentences let's see why that happens in simple RNN every input is passed at different time stamps so at first time step Mr Watson will be passed then

go will be passed on uh business trip and so forth and this activation a which is also called as uh hidden State sometimes this hidden State acts like a memory context it holds the contextual information or the valuable information that it has seen so far for example Mr Watson is a subject might be a valuable information business trip might be a valuable information Wednesday which is a day could be a valuable information so all these informations are stored in this in the form of a matrix right the combination of different values that this Matrix has

creates this context but the problem is that this activation a is updated at every time stamp if you look at the equation of a uh then you will realize that it's updated at every time stamp and if you're dealing with words then this memory context is only onedimensional and it might not be too huge right so it has only a very limited memory that it can hold so what's going to happen is that after it has seen like let's say 20 words then this context is not going to be able to uh remember what it

has seen before so it's very likely that Mr Watson might have been lost when we have come here and thus this a is like a ram it's like a shortterm memory where the value of the previous context is slowly slowly overwritten by the new information of the sentence so that is the limitation of simple RNN now another way to prove this is by back propagation now if you have watched my previous video on back propagation then you would know that the DL by DW there is represented by this equation and this is a summation from

I = to 1 till the time stamp so if you expand this summation then it looks like this now this term here represents the derivative of output o with respect to the activation at time stamp 1 that means when we are back propagating from time stamp T to time stamp 1 we have to multiply a lot of gradients and the value of this gradient is between 0 to 1 whenever we multiply a lot of small numbers which are less than 1 then the resultant value becomes close to zero now this problem is called Vanishing gradient

problem that is why simple RNN does not remember or holds the context of the information that it has seen at the beginning of this sentence when we are dealing with long sentences that is why we need uh something better uh unit than simple RNN which can be either lstm or Gru now if you watched my previous video on lstm then you would know that lstm helps us to solve this kind of problem because lstm has both long-term and shortterm memory while the gru is the same but instead of having two different memory States in which

one act as a long term and one acts as a short term in Gru both the long longterm and short-term memory are combined into one single hidden State and thus Gru is like a lightweight version of lstm which takes less computational power compared to lstm the way Gru works is that it uses the concept of gates it has two gates one is called the update gate and other is called the reset gate the update gate helps with retaining some information for a long period of time and thus is responsible for that long-term memory in this

unit while the reset gate is responsible for forgetting some information giving some more room for the new information to come in and there's a reason why it's called Gates because the gates basically allow some information to pass through while restricting some more information this can be better understand when we look at the equations now the equation of hidden state is a combination of gates and something called as candidate value the equation of candidate value is very similar to the equation of activation in simple RNN you know that this is the equation of activation in simple

RNN what I've done is that I've concatenated these W matrices waa and w a x into one Matrix and also concatenated a t minus1 and X into one Matrix so so this equation of the candidate value is similar to the equation of the Hidden state in simple RNN and this acts as a candidate value because this state is updated often with every new word so with every new word it's adding new context in its memory that's why we have this name as candidate value because it's providing new candidate information to the hidden State now let's

come to update gate the update gate it is designed in a way that it holds either 0o or one as its value notice that the equation of hidden state has two parts and in one part we have multiplied UT and in other part we have multiplied 1 minus UT so if the update gate UT can take either only zero or one as its value then only one of this unit is going to have its effect while calculating the HT if UT is equal to zero then this value will be nullified and HT will be actually

equal to HT minus 1 so this HT is not going to be updated at this time stamp because it's just equating its value with the previous time stamp and if the value of UT is equal to 1 then this will be nullified and thus plus it will add the candidate value in its memory State let's dive a bit more deeper into this to understand because we are dealing with matrices UT will be a matrix let's say UT is represented by this Matrix and C till. T which is our candidate value is represented by this here

this black color means that it has zero as its value and one color means that it has one as as its value and this blue will will consider that it's providing some new information to the hidden State when we multiply these two we're going to end up with a matrix like this so this is this part of the equation if you look at the other part of the equation let's say this orange color Matrix is represented by the activation that we had at the previous time stamp if UT is this then UT minus 1 will

be completely opposite of this which means that this is going to have one as its value here and zero as the values here if we multiply these two then we're going to end up with a matrix like this now we just basically have to add these two matrices together and we will end up with this as our HD so here you can see that some information from the previous hidden state is holded as is while we have also added some new information if we assume that these four boxes had the context of Mr Watson then

this context2 information is still holded at the next time stamp and this process can be continued for many many time stamps as long as the network wants thus the context of Mr Watson will be holded in its memory for a long period of time thus giving it a capability of long-term and short-term memory now let's have a look at how our update gate is performing its magic the equation of update gate looks like this which is similar to the equation that we have seen before just like the equation of the activation in simple RNN with

only two difference the activation function here is sigmoid and this weight Matrix and bias is different the sigmoid function here is performing its magic because uh you know that the sigo white graph looks like this uh what's going to happen is that most of the values are going to lie at higher end of this graph or at the lower end of this graph which means that the most of the values will be either close to zero or will be either close to one and thus it's acting as a gate and then you will ask me

how does the update gate knows what value should be zero or what value should be one right the answer lies in this weight Matrix remember we are going to train this weight Matrix and after training it will adjust it value it's such a way that it will know what is useful context and what is not useful context now it's not that the value of the update gate is going to remain constant or same throughout all the time stamps but notice here that we also have input X here which means that the update gate values are

going to change based on the input X at different Tim stamps by looking at every word it will know if this word is worthy to be updating or is it worthy to retain the old context it adjusts what value should be updated or what value should be retained now so far we have not talked about reset gate so let's talk about that it's not always necessary that we just want to retain the information sometimes you also want to forget some information for example if we add another sentence to our previous sentence that Mrs Watson his

wife loves cooking food and he is also a businesswoman and th clearly that the subject has changed from Mr Watson to Mrs Watson and here we should have she and not he so reset gate helps us to forget some old information that is irrelevant allowing more room for new information to come in it's not always necessary that we just want to replace the subject but sometimes let's say that if this if our Network feels that now the context of Wednesday is not becoming relevant anymore then the reset gate can just forget this word the equation

of the reset gate is similar to the equation of the update gate just that it uses the different weight matrices but this reset gate is actually used in calculating the C Tilda the previous equation that I showed you for C Tilda was this but now this is changed to this equation what we have done is that we have just added RT to HT minus1 so RT T is Multiplied this is element wise multiplication with HT minus1 that's the only change that we have made in the previous equation and the actual equation of the C Del

T becomes this if you want to visualize it in a better way then we can expand this term which will be represented by this this WC Matrix which is a concatenation of these two matrices I've just expanded them so now we have two separate terms um let's say this term is represented by this and this term is represented by this and let's assume again that the context of Mr Watson was holded in this region of the Matrix so as we are multiplying RT element wise with HT minus one RT is going to nullify these values

and the resultant term will be something like this uh note that this blue colored is summed with green color giving new information but this green color is passed as is which indicates that more emphasis is given to the input X and we have forgotten what we had here in the previous time stamp so that's it about gated recurrent unit in this video we saw that simple RNN has limitations it's not good enough with long sentences because it has only shortterm memory while to overcome this we have units like Gru and lstm Gru has two gates

update gate and reset gate update gate helps to retain some information for a long time while reset gate helps in forgetting some old information Gru has only one hidden state which is a combination of gates and candidate value it has two parts and update Gates is used in such a way that either of this part will have an effect in calculating HT for the next time stem either I'm going to retain some information from the previous time stamp or I'm going to add some new candidate information equation of candidate value is very similar to the

equation of the activation in simple RNN and thus this is responsible for providing new context to the hidden State the equation of UT is also similar to this but it has a different activation function it is using sigmoid and because it has a sigmoid function the majority of the values lies at either one's end or Zero's end thus acting as a gate and the reset gate is present in the equation of candidate value it resets some old information from the previous time stamp allowing more room for the new information for the candidate value I hope

you found this video valuable if so then please do hit the like button also share it among your friends subscribe to this channel if you have not already I know that I I'm uploading this video after a long time but I I do plan on uploading more and more videos do let me know in the comment section below what do you think about Gru and as usual I will see you in the next one