Gemma 3 - The NEW Gemma Family Members Have Arrived!!!

10.41k views2755 WordsCopy TextShare
Sam Witteveen
In this video, I look at the release of the new Gemma 3 models, which come in four different flavors...
Video Transcript:
okay so it's not even one year since the Gemma models made their debut in fact that was February 21st in in 2024 and not long after that we also got Gemma 2 if you remember that was like the end of June in 2024 and after that we basically had a 2B model and a Japanese fine tune of that model but after that things have been pretty quiet up until today so today is the release of the Gemma 3 family of models and I will say a family of models not only do we have two models
like we got with the Gemma 2 we've actually got four models this time we've got a 1B Model A 4B model a 12b model and the big 27b model as well not only that Google is also releasing both the base models and the instruction fine tune models so one of the things that's been really disappointing you probably heard me talk about with some of the models recently is that they stopped releasing the base models so if you wanted to do your own fine tune and stuff like that you were always stuck the great thing here
is that we don't have that problem with Gemma 3 we've got access to be able to do our own fine tunes our own research test out different ideas for doing things like RL and reasoning fine chunes with the small models and then if we wanted to scale that up we can even do it with the 20 7B model that's in here so let's have a look at exactly what they've released and I'll talk a little bit about the models all right so jumping into the models the first thing that we notice about these is that
Gemma 3 has gone multimodal so for the 4B 12b and 27b not only can they handle text they can also handle Vision understanding so you basically put images into them so unfortunately the 1B doesn't have that but the three other models have this multimodal understanding in there and that's actually done with a modified sigp encoder so a little bit similar to what we've seen for things like Pary Gema and stuff like that but now you can use these models for visual question answering for a whole bunch of different tasks that we'll look at in there
all right next up these have actually been trained to have a much longer context than Gemma 2 so the 1 billion model has a context of 32,000 tokens but all all the other models actually have a context of 128,000 tokens by default so this is a huge improvement over what we saw with the previous Gemma models in the past and with a lot of the open weights models that are out there many of them actually start with a much smaller context window and have to be trained with rope Etc by the community in this case
Google's done all of this for us and what they've done is that they've initially done training with 32k sequences and scaled that up up to 128k with rope so all four of the models have a massive Improvement for multilingual data in here compared to something like Gemma 2 and actually the amount of multilingual data that they used is about double what was used for the Gemma 2 models so I think that combined with still using the 256k tokenizer that Gemma 2 had which was very good for a variety of different languages these models really open
up the ability for people to use them for multilingual tasks but also to be able to take them and fine-tune them for very specific language Etc so I would not be surprised if that we see Gemma 3 being used a lot for language specific versions whether that's Korean whether that's European languages Etc okay so if we look at some of the details about the pre-training for this model each of the models has been trained for a different number of tokens so for the 27 billion model it's been trained for 14 trillion tokens for the 12
billion model it's being trained for 12 trillion 4 trillion for the 4B but 2 trillion for the 1B so you got to think in many ways that 12 billion parameter model that's being trained for 12 trillion tasks is really going to get a lot of bang for buck now obviously the 27 will be the big one that gets the best scores Etc but I think for a lot of people serving a 12 billion one perhaps like as a quantized version of that model should be a pretty rock solid model for doing a variety of different
tasks and they also mention in here that each of the models is an improvement over what they had before with better math and reasoning Etc but supposedly the 4 billion model is competitive with what the 27 billion model was getting for Gemma 2 and the Gemma 3 27b Model is comparable to the Gemini 1.5 pro model from last year that's a huge increase there and I think it's kind of interesting to look at they've done quite a number of changes in how they actually set the model up so they've changed the attention layer architectures in
here they've optimized that a lot over the Gemma 2 models on top of the actual number of tokens that these have been trained for all of the models have also been trained with knowledge distillation and they talk about that they've enhanced a lot of the data filtering techniques to try and improve the data that's actually going into the model here the models themselves have been trained on both TPU v4s and v5s which is kind of interesting so they're not trained on the latest tpus you could imagine that perhaps those tpus are being used for the
next round of Gemini models as we speak now while they don't talk about how many posttraining examples have been done for each of the models Etc they do talk about that this extended posttraining approach does use knowledge distillation also is using number of different types of reinforcement learning to both give it alignment but also to help it with mathematics and reasoning Etc right so I think the best thing is let's jump in have a play with the models see how they go and see what we can actually get out of them okay so let's jump
in and look at the demo in here so the first one I just want to show you we can basically take in an image and obviously process this like most the other ones but because we've got a reasonably strong multi- lingual model we can even get it to do things like translate the outputs and stuff like that that it sees okay so you can see we've got the sign there we're asking what it says and then translate it to English and French so it's going to break down okay what's actually in the image and so
it's giving us what it actually is in the original language and then it's basically translating to English and then translating to French so you can see that because it's got a strong multi lingal element to the model we're able to basically make use of the vision model but also use the multilingual things at the same time the other thing that this model is pretty strong at is being able to deal with multiple images at the same time or even be able to recognize multiple images in something so you can see here that we've got these
two images I guess one is daytime one is nighttime and you can see that it can analyze these and basically give us back okay what's the difference in here and we can see that okay sure enough it's picked up the difference being the time of day Shadows Sky a whole bunch of things like that this whole idea of multiple images is also really nice for things like this where you can just upload a bunch of images and ask it to actually create a story based on these images you can see it's basically generated a story
of the Quant conspiracy see it comes up with a name for the dog we've got some gold we've got a girl in there and you can see that it's sure enough it's basically using these looking out we've got the girl with the frogs on her hat Etc and I think pretty much most of these images have been used in addressing a story so this is kind of cute for lots of different ideas that you could sort of play around with but you could also use this for things like zeroshock classic ification where you pass in
an image of a positive example of something and then a second image that is a negative example and then pass in other images and ask it to basically classify on those so again this sort of shows off the strength of easily being able to handle multiple images in here all right if we start looking at how it can deal with text remember this is not a dedicated OCR model I do think that what we're going to see though is is that we're going to see these models get fine-tuned for OCR tasks especially now that one
we've got a very strong Vision language model here but also too that we've got things like the M OCR has released all their sort of training scripts their data for actually training up these kind of things that said you'll see that this does a pretty nice job at being able to OCR the text that's in here and get out most of that text I think it's almost all correct in there and this can do similar kinds of things with handwriting here so I don't think it's necessarily as accurate will depend a lot on the handwriting
but again you've got something that's able to get this out because it's prompt based we can ask for it to be in markdown we can ask for it to capture the various elements in sections Etc and then we can even do things like followup obviously with other prompts and here I'm asking it to basically take the math equations and put them into latex and you can see that okay sure enough it's generating latex for some of this although it looks like it's been confused a little bit by some of the formatting but we're seeing that
the model definitely understands the concept of latex and stuff like that in here so remember that the model that I'm playing with here is the 12b model so this is in the middle of the models and that the 4B model is supposed to be on par with the old 27b model and the 27b model is supposed to be on par with roughly we Gemini 1. 5 Pro was so this is a serious set of models and you got to commend Google for releasing what was almost equivalent to their previous proprietary models in their open weights
format in here of course we can come in here and do things like visual question answering where we can basically pass in something and then actually just ask a question about it and of course we can just use the model as a normal text model in here so we don't need to use the image capabilities of it all the time at all we can set the standard things like we would with any of the Transformers models we can set a system prompt we can pass in the temperature all those sorts of things and see sure
enough this is basically giving us good standard non reasoning model levels of text but for something that's 12b and remember that the quantized version of this you'll be able to run pretty easily on most computers this is a pretty strong model my guess is that this is going to be supported by a by LM Studio Etc so if you are looking for a good mediumsized model that can support images Etc as well you got to start thinking about either using the Gemma 3B or the 12b and that allows you to do a whole bunch of
different tasks locally that in the past you would have had to basically send it out to the cloud for a proprietary model to do this so you can see here if I ask it for just a caption it will give me a nice simple caption but it's able to then do these detailed captions as well and the good thing is that unlike things like Florence and stuff like that where it's been trained for specific things here it's all up to your prompt so if your prompt is looking for something specific in the actual image you
can basically put that in your prompt as well but sure enough asking for a very detailed caption it's given us exactly what we asked for in here all right let's jump in have a quick look at how you could set this up in the Transformers library and use it via code okay so to run this in code you will need a new version of the Transformers Library so I'm actually trying this before the release so there are a few things that are a little bit different in the code but they will have either I think
it's a 4.49 or 4.50 version of Transformers out at the time of the release and then you got a couple options in here you can basically use it from the simplest way is just to use the pipeline and they've got a new pipeline here where you've got image TT text to text right so that you can basically just load up the model set it up and then to actually use it so here you can see I'm just loading up the B image Etc then to actually use it you just basically pass that in that image
URL in and it can do it like that if you want to run it without the pipeline they've got a new class of Gemma 3 for conditional generation you can set that up and then with this you can also use the chat templates and stuff like that so that's how you would use use this here now I expect by the time this video is out or a little bit after that you will also see a version in llama you should be able to use that and hopefully that will be fully compatible with their VM SDK
so that you can use images and stuff like that just like you would normally in there and it should be also up on kaggle and on Google Cloud platforms vertex model Garden for you to use as well so de mine is clearly getting behind the release and they're trying to get it to the platforms the 27b model may also be up on AI studio for you to try I'll do some more videos with the new Gemma 3 over the next couple of weeks we'll look at what you can do with it when you're using it
locally what you know at some of the tricks if you want to use the cloud version Etc and even how you can use this in some of the agent Frameworks going forward overall the Gemma 3 release and the family of models and I haven't even covered things like there's a new Shield Gemma model coming out as well overall they're very impressive definitely worth checking out if you're into using local models or on Prem models Etc if you're doing any sort of research with models the 1B and the 4B allow you to try out a lot
of different ideas with these small models that still are very strong for their size Etc all right as always if you've got questions Etc put them in the comments below if you found the video useful please click like And subscribe and I will talk to you in the next video bye for now
Copyright © 2025. Made with ♥ in London by YTScribe.com