as you know if you want to apply machine learning you need data and often the more the better if you want to build a classifier like a cat detector you need a label which images have cats in them if you want to find fraudulent credit card transactions you need to label which transactions had fraud but often labels are hard to get because they're expensive and they require human to review but you might have a lot of unlabeled data so maybe you have a lot of credit card transactions but you don't know which ones are fraud
and which are not and it would be great if you could somehow leverage this data to learn something interesting that will help you solve other problems that you care about this is exactly what autoencoders are there unsupervised learning process that lets you take advantage of your unlabeled data and learn interesting things about the structure of that data that's often useful in other context so for instance you can build a better classifier by using autoencoders as a feature extractor or if you don't have any labels at all you can use the auto encoders to flag anomalies
that might get your labeling process started you can even use them to fill in missing values in this video I want to talk all about auto-encoders how they're structured how they learn and how you can put them to use I also want to talk about an interesting and kind of lesser-known variant called denoising auto-encoders so let's dive in [Music] in their simplest form an autoencoder is a neural network that attempts to do two things first it compresses its input data into a lower dimension then it tries to use this lower dimensional representation of the data
to recreate the original input the difference between the attempted recreation and the original input is called the reconstruction error by training the network to minimize this reconstruction error on your data set the network learns to exploit the natural structure in your data to find an efficient lower dimensional representation let's dig deeper into each part the left part of the network is called the encoder its job is to transform the original input into a lower dimensional representation that sounds pretty complicated so let's take a couple minutes and discuss what it means to project something down into
a lower dimensional representation and build some intuition for why this is a reasonable thing to do the idea behind this is fairly simple imagine do two inputs our city and country maybe you have Tokyo Japan Paris France and so on even though it's conceptually possible to have Hong Kong Spain we don't actually see this in the real data and this is because real data often lies on a lower dimensional subspace within the full dimensionality of the input the point is the real data isn't fully spread out across all possibilities but actually makes up a small
fraction of the possible space for instance here's an example points that are evenly spread throughout three-dimensional space there's no structure here because the data is totally random and there's no way to describe the location of all of these points using fewer than three numbers per point without losing information because this data truly spans all three dimensions in practice our data has structure which is another way of saying that it's constrained remember Hong Kong Spain is conceptually possible but we won't ever see it in the real data so that part of the space is unoccupied here's
an example of constrain data in the same space we can still describe each point with three numbers but this is somehow inefficient since the real data is constrained to a one-dimensional spiral the trick would then be to find a new coordinate system where the constraints of the spiral are ingrained into it and then we would only need a single number to describe any point without information loss for this spiral example we can represent it exactly here are the equations that translate the single angular dimension theta into the original three dimensions for any particular point on
the spiral I can choose to describe it with a single number theta or I can describe it with three numbers XY and Z it just depends on the coordinate system I'm using so what does all this have to do with autoencoders well the encoder approximates the function that map's the data from its full input space into a lower dimensional coordinate system that takes advantage of the structure in our data so this section was pretty dense and mathy so let me quickly summarize our real data is not random but instead it has structure and that structure
means we don't need every part of our full input space to represent our data and it's the encoders job to map it from that full input space into a meaningful lower dimension so now let's move on to the decoder the decoder attempts to recreate the original input using the output of the encoder in other words it tries to reverse the encoding process this is interesting because it's trying to recreate a higher dimensional thing using a lower dimensional thing this is a bit like trying to build a house by looking at a picture of one we
mentioned before that your true data can likely be described using fewer dimensions than the original input space but the point of the middle layer in an auto encoder is to make it even smaller than that this forces information loss which is key to this whole process working by making it so that the decoder has imperfect information and training the whole network to minimize the reconstruction error we forced the encoder and decoder to work together to find the most efficient way to condense the input data into a lower dimension if we did not have information lost
between the encoder and decoder then the network could simply learn to multiply the input by one and get a perfect reconstruction and this would obviously be a useless solution we don't need a fancy neural network just to multiply something by one the only way auto-encoders work is by enforcing this information loss with the network bottleneck but this means we need to tune the architecture of our network so that the inner dimension is less then the dimension needed to express our data but how could you know that in advance what we really want is a way
of learning these representations using whatever architecture we want without the fear that the network's gonna learn this trivial solution of multiplying by one and luckily there's a clever tweak we can make that avoids that problem and this gets us into the world of denoising auto-encoders the idea is this before you pass the input into the network you add noise to it so if it's an image maybe you add blur then you ask the network to learn how to erase the noise that you just added and reconstruct the original input so the reconstruction error is slightly
modified so that the input to the encoder now has a noise term added this means the network multiplying the input by one is no longer a good solution because this would just return the distorted image and still have a large reconstruction error this is called a denoising auto-encoder because it attempts to remove the noise that we added artificially to the raw data now that we have an understanding of how auto-encoders are structured and learned let's talk about some ways you can use them the first is as a feature extractor in this case after we complete
the training process we chop off and throw away the decoder and just use the encoder part of the network the encoder then transforms our raw data into this new coordinate system and if you visualize the data in this new space you'll find that similar records are clustered together here's a plot of embeddings of credit card transactions learned with a denoising auto-encoder if you look at the raw data of one of these clusters you'll notice that they have almost identical features this should make for an easier job for the classifier as the auto encoder did a
lot of the heavy lifting and therefore your smaller data set will likely take you further this can be useful even if you're not building a classifier for instance if you had a particular record of interest you can find its nearest neighbors in this space a running clustering algorithm to find other records that are similar this is likely to be more effective than clustering on the raw input features since the network has learned about this structure in your data this is particularly useful if your data is categorical in nature as in this example since it's not
obvious how you search for nearest neighbors when your inputs aren't numeric and if you don't have any labels at all you can still use auto encoders for anomaly detection in this case you keep the full auto encoder and use the reconstruction error as the anomaly score to grasp of this consider our previous example of a one-dimensional spiral in 3d space what happens if we train our auto encoder on these spiral points with an input an anomalous random point that's far from the spiral since our auto encoder who would have only seen spiral points the decoder
would likely return a point that's close to the spiral even though the input point was far from it so for anomalous input points that are far from the spiral we expect a large reconstruction error since the auto encoder just can't represent it well this is why the reconstruction error can be a proxy for an anomaly score the nature of an anomaly is that it doesn't respect the normal structure of the data and this is where the auto encoder will have a hard time finally you can use denoising auto-encoders for missing value imputation as an example
let's say you have these four rows of data where the first three are complete but the last record has a missing value the idea is that we train the network by randomly replacing true data with missing data and ask it to learn to erase the noise then once the model is trained we can pass in inputs that actually have missing fields and use the network to predict what the missing values are likely to be we can then use these predictions to replace our actual missing values I hope this video has given you an intuition for
what auto-encoders are how they're learned and what you might use them for if you'd like more content like this you can subscribe to my mailing lists at blog Zack jost calm