How AI 'Understands' Images (CLIP) - Computerphile

203k views3998 WordsCopy TextShare
Computerphile
With the explosion of AI image generators, AI images are everywhere, but how do they 'know' how to t...
Video Transcript:
so I thought we could talk a bit more about you know large language models AI image stuff in a previous video we looked at stable diffusion and image generation using diffusion and at one point in the video I basically handwaved off this kind of text embedding text prompt thing we embed this by using our GPT style Transformer embedding and we' stick that in as well right so the idea is that you've got this model that you've trained to try and produce new images but you want to be able to write some text that explains what
you actually want in the image and how do we give it that text how do we say I want a frog on stilts right how you know how do we pass the the text frog on stilts into a network what we're trying to do is we're trying to represent an image in the same way that we represent language within a model that's the idea and it's and we call this process clip right or contrastive language image pairs so there are lots of times in AI where what we want to do is take an image and
repres present it in some way that we can talk about it right so when you're talking to you know one of the more recent chat gpts or um co-pilot on you know on Windows or whatever you can put an image in and you can say what's in this image and it will try and explain it to you or you can say what happens in this image when something you know and it can it can kind of Reason about these things or at least that's the the implication of what the text it's producing is right that's
that's for a different day the question I supp suppose is how do we turn an image into that kind of language so that we can then talk about it so one way of doing this is that you could just train up a classifier so you could say I've got my image net classification task with a thousand classes cats dogs airplanes buses hotels whatever and you just train your thousand class classifier on all images and then if you want text from the image all you do is you put it into the classifier it says it's a
cat so you just you've got the word cat right you've solved it now this isn't a particularly effective solution because there are probably more than a thousand things on earth right at least I think there are I haven't listed them so if you train your thousand class or even 10,000 class classification problem it's only still going to work on those things you trained it on and even then it might get it wrong so it doesn't scale particularly well okay and so every time you want to introduce A New Concept so let's say you want a
specific type of cat or you want a new species of animal that you haven't considered before you've got to go back to the drawing board collect a whole new bunch of data and then train again so that doesn't really really work at all so the next thing you could do is you could say okay well let's try and cut out this this classifier and just we'll just predict text right so this is quite a common thing to do so for a long time captioning was was essentially the goal of this kind of process so what
you would do You' have an image and you have a network or a Transformer that's trained to just spit out a sentence that describes that image and this is quite a popular uh problem that's been solved in computer vision literature so you put an image and you say this is a man sitting in front of a boat or something like this this works better in the sense that you can get better text out right instead of just boat or man you can get this is a man in front of a boat but it has that
same problem of scale if man is standing in front of a hotel and you haven't given it any examples of that maybe it won't work quite so well so what we haven't really managed to do is any kind of scalable way of pairing images and their meaning to the text that describes them right and that's what a clip embedding does right the idea is that we're going to try and find some kind of embedded numerical space where the images and the text are now in the same place this this a bit like the word vectoring
that we it's quite a lot like that except that we've trained the vectors to align both between text and images right so the idea is imagine some high dimensional numerical space where everything has a fingerprint but that if you take an image and the text represents that image they'd have the same fingerprint and that way you can go ah they're the same thing you know or at least that that caption represents this image so how would we do that well the first thing you need to do is collect an absolutely massive amount of data right
so the sort of size I mean this the clip paper that came out in fact I've got I've got a copy of it so this paper is from 2021 they collected a data set of 400 million image caption pairs which we can talk about that collection process in a moment that will be considered quite small by today's standards right today 5 billion might be a more reasonable number of images now downloading 5 billion images is not very easy on your laptop right if you down if you get a cluster of servers going it still might
take you a week right this is a monstrously large amount of data but there is a lot of stuff on the internet not all of it good I should add anyway so the policy was basically go on the Internet and try and find images that have captions right so either in alt text or sitting nearby on the website or you know something like this and then try and find captions that have some usable interesting information so you don't want like an is ISBN number for a book right that's not very helpful captioning um you want
a description of what's in the image some of the descriptions are going to be very good like a man in front of a boat wearing a red jumper and all this stuff some of them are going to be a dog and you know and you sort of you take what you can get right when you're scraping the internet some of them are going to be known classes some of them are going to be should we say not safe for work classes right some of them very problematic and so but you unfortunately there's now so many
images you can't really look through very well and find them so is this mainly still images or all still it's all still images right you could do this on videos but but that's a separate piece of work let's say right so so you go on your web you you you get a web crawler going which is a bit like you know one of the search engine might use but it's specifically designed to go and try and find images that have captions with useful words in which you can then download and now you've got your 400
million images and what you want to do is train a model of some description that Maps those text Pairs and those image pairs together so let's start with a classifier right could we do that we could have a 4 million classification problem doesn't make any sense but that's not going to work um also some of the prompts are going to be similar but not quite the same a red cat a black cat isn't red cats but anyway know what you mean but you can't you can't control what you're going to find on the internet right
so absolutely but the point is those are very similar they're sort of the same class right but they're not so you know a class-based system where you're going to try and put things in specific categories isn't really going to work what we want to do is find a way of embedding this these play into a sort of numerical score right represents so here's how we're going to do for our four million image data set right or even bigger we've got a bunch of image and text pairs so I'm going to represent text as just a
line right it's it's actually a numerical encoding of that text because of course you can't actually put a string of text into a new network you turn it into numbers first right so we have an image and we have an image and we have an image and we draw them this way all the way up to image 400 million or something like this so I'm I'm going to leave some out of my drawing right quite lazy actually personally I've got there's some more paper here this is why you do this on a computer and not
by hand so what we're going to do is we're going to create two networks we're going to have the images go into what we call a vision Transformer a vision Transformer and that is a Transformer model like we've talked about before sort of same sort of Transformer models you see in things like defusion or big large networks and it's going to put this into this embedded space so the input is an image with let's say red green and blue channels and the output is a numerical Vector that says there's this in the image right and
we don't know what that really means on the other side we're going to have our text string and we're going to put this through a normal text Transformer similar to the way that you would structure something like chat GPT right so this is just going to be text Transformer I don't know how I don't know what the code for that is I'm just going to write T and what we're going to try and do is across this entire data set train it so that these embed to the same place when they're rep pair and when
they're not a pair they embed to different places right because there different things in the image so how do we train this well what we do is we take a batch of images let's say five images so let's say our batch or our mini batch is five okay so we're going to have five images with five bits of text that go with those images and we're going to put them through our vision Transformer and through our text encoder right so we're going to get image image image image image nearly page we made it text text
text text text and I thought briefly about drawing in different images in here and then I realized that that would look really bad because my cat looks a lot like my dog and my horse so I'm going to pretend we've got very nice pictures in there now if we embed this we're going to get an embedding for this one and we're going to get an embedding for this one it's going to be some feature Vector we don't know what it means and what we're going to do is we're going to calculate the distance between the
embedding of this and the embedding of this let's just call that a distance D and we're going to going to have a distance between these ones we're going to have another distance here another distance here another distance here we're going to have a distance between this one and this one that's going to be a different distance my notation is all off so let's call this D1 D2 or whatever right it doesn't really matter it's a big Matrix of of distances right D3 D4 D5 now what we've done is we've set this up in a way
where this is the first pair and this is the second pair and this is the third pair so the good stuff that we want the embeddings that we want to be the same are on the diagonal right all all the other embeddings we're going to say that we want to dissuade you from from coming up with a similar embedding in a similar image so what we do when we train it is we take our entire batch we calculate all of the embeddings for all of the different uh images and text and then we calculate the
distance between each embedding and we're going to maximize the distances on the diagonal and for the sake of nice colors minimize all the other distances in The Matrix and we're going to do that over and over again again for 400 million images we're going to train this viit and we're going to train this T to encode these things so that they meet in the middle like this or if they're a different image so if it's a picture of a cat but the text says a man in front of a boat it doesn't embed them into
the same place it embeds them into very different places now the metric we use for this is something called cosine similarity you could use different metrics cosine similarity is best easiest to think of it as the angle between two numbers right so let's imagine that you have a two-dimensional uh just show a quick one let's imagine your feature embedding is only two right so you have two things this is a bit like when we did face ID right when we're talking about phones oh when you can unlock your face with a phone that's the one
if you want to unlock a face with your phone right you have an embedding over here you have an embedding over here and this is the angle between them if you have a different embedding over here that's a bigger angle and so these are more similar than these two and cosine similarity just measures this angle but in a higher dimensional space it's like an angle of vectors so it's an angle of a vector in like a 256 or element space so this works really really well we actually we're not training the images we're not training
this Matrix we're using this Matrix to train this encoder and this encoder here right so they're taking images putting it in the medage space taking text putting it in the better space okay and so the end of the question is well okay we've done that right let's say we've done over 400 million images now what do we do with this well the point now is that you've got a way of representing the meaning or the content of a photo in the same way as you can represent captions of that photo or captions of a different
string so now let's imagine you were trained let's just show a couple of examples of a sort of things you could do right clip gets used in a lot of places right for a lot of what we call Downstream tasks so the downstream tasks are how you use a clip once it's trained so let's imagine you've got your diffusion model which is trying to produce a lovely image of something that you you've written in your prompt so you have your gaussian noise and you have your denoise noising process which is trying to produce a nice
new lovely image of a of a person mouth ears right here there we go I'll stop now now but we want to put some text in to say we want a picture of a person instead of a picture of a frog or something like this so in in here what we do is we take a man a man in front of a boat which I'm now going to have to draw in so know there we go right and we put that in we embed it using the text encoding that we've just created using our clip
training so this turns this into a big set of numbers let's say n .1 minus 0.5 all the way along right it's very large list of numbers but doesn't literally say a man in front of a boat it just embeds the kind of meaning of that sentence in the same way you could embed an image of a man in front of a boat and then what you do is you insert this this into the network during the training process to say that's the guidance I want to give you when I skipped over this in the
stable diffusion video that's what it's doing it's taking a pre-trained clip encoder encoding the text to guide the generation of the image let's think of a different Pro problem so one of the nice things about clip or one of the things that the advocates for clip would say was a really positive application of it is what we call zero shot classification for example so zero shot means you're classifying images despite never ever having been asked or trained to classify an imag so how would you do that well it's a bit weird right because if you
think about you've got something that will embed an image into our embedded space and you've got something that will embed text into unembedded space but you can't undo that process it doesn't work in Reverse so that means we can't take a text string embed it and then uned it into the image you can only go this direction right this this direction here and this direction here so let's imagine I wanted to find out what was in an image but I didn't want to bother using this classifier training program that I'd come up with I want
to use clip to do this right they've been they've trained on 400 million images got to se a cat before I should be able to classify cats right if I want to so what do we do so it's slightly odd what we do is we first embed a bunch of strings that say the image contains a cat right so we say we have a string that says a photo of a cat now of course it might not be a cat right so we also have a photo of a dog and we keep this going for
all the different classes that we have all the way down to a photo of a bus right those are the three things I can think of but you actually have to write this physically write the sentence out and then embed it using this into your embedded space you embed this you embed this you embed this and these are essentially your lookups right these are the embedded representation of that sentence and what we're doing is we're trying to find which of these is closest to the embedded representation of His Image so we take the image that
we're trying to classify which has got a picture of a cat in it how how do you draw a cat sort of rabbity Catty thing right you embed this into this into this cat embedding right this is our sort of this is our kind of test embedding and now we have to see which of these are closest to it we measure that coine distance and we can say well okay actually it's closest to the the sentence a photo of a cat so what we haven't done is explicitly say this pictures of a cat we've said
that if you embed this picture using our clip embeddings and you also embed the phrase a photo of a cat those things are quite similar which implies that it is a picture of a cat right now that is not fullprof in any sense and also if you're thinking that all seems very inefficient right couldn't you just train a classifier yes right for some problems I think training a classifier is by far the easiest way of doing this but this gives you this kind of scalable way of doing it where you don't actually have to explicitly
tell it what to do you can hope it comes out in the wash later before when you said inject this into the stable diffusion does the diffusion have to have seen the embeddings how does it know how are you talking in the same space okay yeah well it gets learned during the process so the idea would be that suppose you're doing the training rather than the the inference right you're not producing new images you're training you take an image of a cat right again I'm going to have to right and it needs a whiskers I'm
sorry but that's the cat now it's not a good one though is it now you have a you know picture of a cat right and and this is your training pair so you no longer have just an image that you're trying to construct so what you do is you add some amount of noise to it and you train your network to say this was the noise I added or this was the reconstructed image right so this is a clean image of a cat which is now different but close enough but you also give it this
text so what you're saying is you're you're learning that given an image that looks like a noisy cat and the text that it's meant to be a picture of a cat give me a clean image of a cat right and over time it learns to do this in a General case and it learns to do it where you can put different text in here that's the idea but you have to do this during training because if you didn't ever give it this text during training there'd be no way to tie the two concepts together so
this network learns this over time um a lot of this it requires massive massive scale if you want to get really good images with really nice nuanced text prompts you have to have a lot of examples because if you just have a few dozen cats it's going to be very poor and it's not going to properly reflect what you you could say a picture of a cat but you couldn't say a picture of a cat wearing this and in in this place and and it is not very powerful so things like stable diffusion and darly
have been trained over massive image and text sets specifically so that they have this generalizable property or at least they are somewhat more generalizable than they would otherwise be eventually the network is going to start to learn how to I mean actually that's not right cuz Dave's far away from Dave right so hopefully we start to we come together if we just sort of feel our way towards it by taking off little little bits of noise at a time we can actually produce an image right so you start off with a noisy image
Copyright © 2024. Made with ♥ in London by YTScribe.com