Revolutionary! Open Source & Local Video Model STOMPS on VEO 2

5k views3791 WordsCopy TextShare

MattVidPro AI

I think AI video generation might have just had its stable diffusion moment. Today we're talking abo...

Video Transcript:

I think AI video generation might have just had its stable diffusion moment today we're talking about an open- Source AI video generator that is not only open and free but also state-of-the-art and can run on consumer grade Hardware we just rolled triples folks this is W 2. 1 by Alibaba so another Chinese video generator here they seem to be dominating not only the open source AI video generation space but also the closed Source one this one's also achieved a number one score on V bench here so it's at the top of the leaderboard outperforming state-of-the-art open source and commercial models so this thing is a beast all the way around no matter how you shape it up it's a big deal even a simple close look at their demo shows off this thing's capabilities it's when 2. 1 we start off with complex motion we can see woman riding a horse in some long clothing a dancer on Ice someone break dancing and if we play the video we can actually see that it's able to achieve all of these complex traditionally difficult for AI video generation results without morphing bodies around and mushing body parts together incredible it actually looks like someone break dancing and not just a compiled mushy mess of Horrors beyond our comprehension and it gets better we've got a panda on a skateboard this cool bald eagle fencing competition that's very difficult for video generators men's diving also got some options here showing off physics we're actually cutting apart food and it looks real there there aren't new slices of the food spewing out of it it just looks normal and natural and also whilst we're seeing this you know a lot of these videos are cinematic quality High Fidelity stuff more abstract stuff like Vines climbing on a window are generated pretty flawlessly I mean it seems to be a pretty well-rounded model altogether accomplishing lots of tasks while meeting everything the consumer wants this is open and free guys this can be run on a lot of gpus plenty of you at home could probably run this locally for completely free and cherry on top here text generation and various visual effects to go along with it so you can do cool you know writing intros spring growing with vines I unfortunately cannot read I assume this is Chinese but it's doing other languages as well so it seems to be multilingual in that super cool very promising minimum vram requirements here are just under 9 gab so again that is a lot of consumer gpus here's a quick list at a glance of viia consumer grade gpus that have over 9 GB of vram so we're going all the way back to the 30 series not half bad and since it's open source that means anyone can freely and openly modify it so people can probably make it more efficient as well over time making the GPU requirements even lower here's the hugging face page for this model if we scroll down we can see there's actually four models to choose from we've got some larger 14b sized models up to 720p resolution but AI upscaling has come a long way so you can always upscale these videos to an even higher resolution now the model that's going to be using the low amount of vram is this 1.

3b model and while it can generate 720p it's most optimal at 480p if you got a real big honk and GPU though you can try to get one of these 14b models to work which is what I tried to do I had trouble and we'll talk about why but before we get into that I do want to talk about the free options you have to generate with this model first one is going to be this hugging face space this is the official Wan 2. 1 not going to lie guys this one is very popular very congested you're going to be waiting in the queue a while for your generations to go through if you're watching this at a much later date though and you still want to try it out this could still be up and generate pretty quickly for you but for now this is a slower method to generate for free next up they do have a Chinese website where you should be able to generate I don't know if you can generate for free I think you can but the issue here is that you have to have a Chinese phone number and it would also help if you can read Chinese as well here's the signin page I translated this with Chad GPT you're supposed to put your phone number up here and then get a code but obviously I can't get a code cuz I have an American phone number if you have a Chinese cell phone you should absolutely be able to get onto this site and generate with W 2. 1 our final free option here is Korea AI limited free generations with a variety of video generation models but they just added W 2.

1 I've been waiting for this generation for a while so it looks like their servers are swamped but yeah they definitely have W 2. 1 on Korea now if you're interested in running this locally comfy UI did share a native support for Wan 2. 1 which makes it really easy to set up and install the model locally you can see they share a variety of locally generated outputs here that are all very impressive and exciting this will all be linked down below but this is a step-by-step guide on how to actually set it up and get it working in com fui like I said it's super easy download and install comfy UI from the desktop app download one of the WAN 2.

1 models obviously for most of you this is going to end up being the smaller 1. 3b model then you're going to download each one of these three models separately they will literally just start downloading as soon as you click on them from this website and you're going to want to place them in their respective folders if you installed the desktop app version of comfy UI this should be located in your documents folder you can then go ahead and download an example workflow import it into comfy UI and you should be ready to go here's the workflow actually set up in comfy UI I tried multiple workflows I spent a lot of time trying to get W 2. 1 working on my particular system I've got an RTX 90 yes the brand new GPU so P torch isn't officially supported fully for this GPU and we run into issues with this specific scenario if I try to run the model locally actually cue this up you can see we end up getting a Cuda error I don't want to over complicate any of this but by and large pytorch does not fully support Cuda 12.

8 on Windows just yet even with the nightly build yes I tried the nightly build not working for me just yet however if you are on Linux it should work with the 90 and obviously a lot of those older cards 40 series 30 series should work without any issues so yeah I tried really hard to get this working locally I tried other workflows other methods no cigar here it's just issues with pie torch and Cuda GPU is too new but like I said for the majority of you the vast majority of you you shouldn't run into issues like that some even are reporting to have this working on the RTX 2060 with only 6 GB of vram which is insane so honestly try this on whatever GPU you have if you want to do this locally and let me know your results down in the comments anyways let's actually do some quick testing and see how this model Stacks up to Google VO2 all right guys so our first generation here is going to be in Korea AI but Korea is just taking way too long so I'm going to use a separate API for the rest of these Generations I would be using locally if pytorch could support it but at the moment we're struggling anyways The Prompt here a video of a man sitting at a table on the table there's a huge pile of Nails in parentheses the tool he is casually grabbing handfuls of these nails and eating them quickly obviously an impressive feat here and as you can see if you play the video he's sort of just looking at the camera and casually munching on a bunch of nails from the pile it's very close to what we described you can see that Korea actually had to generate an input image here to be used to kick off our video but yeah it's it's a very clear video and I am happy and impressed with this result let's go ahead and now compare that to Google VO2 a paid very expensive model that is only accessible on a couple of platforms right now it's known to be the previous best video generator by the way and we get a pretty similar video here we've got a guy with a pile of nails and he's picking up some of the nails and munching on them although I will say in the vo example I've noticed that the nails are maybe a little bit more warped a little bit more bendy as he shoves them into his mouth I actually prefer the generation personally from W 2. 1 here and that's crazy considering V2 is the previous best state-of-the-art video generator and this one is free and open source there's a little bit of weirdness going on with his shirt here it looks like there's some glitter I don't know if it's just because he's been eating too many nails but yeah at the end of the day the generations are close in quality which is already crazy but I do prefer this one all right for the rest of the generations we're going to be using this W 720p API because I believe it's going to be faster we do have an input image required here the next prompt we're testing is anime gibl style 2D animation of a giant lemon monster destroying skyscrapers 4our image input since it's required I'm going to be using Generations from idiogram AI it's my go too and they just released a new model so I figured why not test it out this is idiogram 2A and it did a pretty good job we've got a nice gibly almost style anime monster we'll upload this guy and click the Run button and here is our generation that actually didn't take too much time but we see the lemon monster move around he uses a giant lemon to smash through a building although his hand does kind of like glitchy whip around his backside but it's a pretty cool little animation nonetheless I think this is very high quality considering again open- Source model could run this locally on certain consumer gpus it's a great little demonstration he destroys the building like we asked in the prompt I am impressed even more shocking take a look at the Google V2 generation yeah we've got an anime gibl style lemon monster we've got some buildings here but he kind of just like puffs up and gets angry he doesn't actually move or destroy any buildings this generation is so much better in my mind it is a lot more animated it's a lot more detailed there's more going on I am a big fan so is W 2. 12 for two right now yes it is our next input image very simple prompt close-up video of an orange tabby cat eating a lemon by the way guys we could also mess with all kinds of settings because it is open source so you might even be able to improve these Generations beyond what you're seeing today that is one of the many beauties of Open Source things that you can especially Run for free on your own hardware and our next generation is complete nice close-up video of a cat munching on a lemon it looks a little bit kind of like stop motion or clamation cuz the frame rate is a little bit lower than you'll find with most video generators that could be changed or increased with a setting but for the sake of time we'll keep it at 16 but it's pretty decent video overall he's kind of biting into the lemon it's a little bit maybe unrealistic in some ways but you can actually see on the crest of the edge of the lemon that the hole inside it does expand as the cat munches down which gives the effect or implies that he is indeed eating the lemon and it kind of looks like how a cat normally eats but definitely not as good as the V2 example here's the Google V2 generation yeah I think this one is definitely a little bit better here you can see the cat is just sort of licking and munching away it looks a lot more realistic I think to how a cat actually eats so I'm going to give the win here to Google V2 I think this is a more realistic generation I think it looks a little bit better to be real though stonkingly close though I mean w 2.

1 comes very close to the leader Cutting Edge V2 so at the end of the day I would give it closer to a tie considering this is open source pretty incredible though not going to lie next up we have found footage CCTV gas station clip man riding at top a giant bullfrog and this will be our input image yeah it doesn't really look like CCTV footage I know neither does the Google V2 generation we'll go ahead and send this one on through alrighty man rides a bullfrog at a gas station you can see that this one definitely was a struggle for the W 2. 1 model absolutely a flop on this generation we could always run it again and see if it does it better but yeah this is definitely not looking too hot we can see the gas station warp and move crazily in the background the frog is kind of moving along nicely the man is pretty still overall but definitely not a good generation for w 2. 1 undeniably the win is going to go to Google VO2 here this is a pretty great video with a guy riding the bullfrog it's pretty realistic overall obviously none of these were CCTV footage as I described earlier but yeah he's definitely riding the bullfrog here in V2 and I think it looks a lot better not nearly as much morphing or detail issues right next up we have a cinematic wedding ceremony Shrek and Bigfoot get married very silly prompt but this will at least test copyright writed characters and mythological Legends go ahead and send that on through all right we've got Bigfoot and Shrek getting married here I actually think that the movements of Shrek as he sort of talks and moves around are pretty reminiscent of the character from the animated movies which is cool and Bigfoot looks pretty stable too I like the twinkling lights here it's a beautiful scene of course I think it is a fair argument to say in this specific case idiogram is doing some heavy lifting here defining Shrek and Bigfoot as separate characters where V2 as you'll see in a second kind of struggles with that but yeah overall this is a pretty impressive generation the background still the characters are moving I even like the slight sway on Shrek's clothing I'm happy here is the Google V2 generation we can see that there is something cool at least with this generation going on we see the swooping of all of the plants sort of growing across this circle thing we actually see the characters walking in the camera moving it does look like more of a cinematic wedding ceremony I'll give VO2 that but obviously we are having some character and subject confusion here now if we could have uploaded that same idiogram video we'd probably get a better result with V2 but we don't know for sure so idiogram might have been doing some heavy lifting for when in this circumstance but yeah honestly I I think uh when 2.

1 still Takes the Cake uh silly prompt regardless but our next prompt is going to be a home VHS video Christmas opening a present to reveal a mini nuclear reactor inside of the present which is glowing green and radiating so the perfect Christmas gift for any child let's go and send that one on through and here we are with our W 2. 1 generation I think this came out pretty stellar it opens in a little bit of a weird way it's not like tearing off wrapping paper like you'd expect but there's definitely some sort of glowing green reactores device on the inside and it's spewing radiation out on Christmas morning so I think that that it did a pretty stellar job with this you can see though that the Box does kind of clip through the tree a little bit as she opens it but I mean overall for open source for being free this is amazing I'm really impressed especially keeping that child stable and keeping the hand stable as she opens the present when 2. 1 is really a game changer man this is pretty eye openening now here is what we got from V2 and I think you'll see some issues with V2 and the fact that it's trained on a lot and I mean a lot of YouTube video data so you know this isn't the typical Christmas morning that we would like or expect to see it's actually like a YouTube demonstration of a nuclear reactor mini nuclear reactor product and I think the nuclear reactor looks a lot better sure in V2 but this definitely isn't Christmas morning now again all there is the argument to be made that idiograms initial input image is doing heavy lifting here but I couldn't even choose an input image for v2 in this circumstance if I wanted to so I'm just going to have to hand the win on over to W 2.

1 this is not Christmas morning yeah I like the mini nuclear reactor I like that it's like a guy reviewing it it's a cool video and it looks very real but it did not follow the prompt as well next up this is going to be our final one cinematic movie footage tracking a woman with an umbrella in New York City she looks mildly displeased as lemons are plummeting to the ground from the sky bouncing off her umbrella difficult prompt let's see if W 2. 1 can handle it and check out the the generation I think this came out pretty decent we've got the disgruntled woman right in the center we've got the lemons falling from the sky as it rains it it it seems to be not only raining lemons but also actually raining there's little raindrops we can even see some cars moving around in the background but they're kind of growing and morphing a little bit they don't quite look right you can see if I actually move my head here yeah there's just some a little bit of weirdness with the cars down there but overall the main goal here the lemons falling and bouncing off it's close it's close I would love to see better physics here I would love to see the lemons really like pounce in and bounce off of the umbrella in a realistic way this does look like something more edited the lemons look definitely a little bit more cut over the original video let me know what you guys think but here is the vo generation I think that this one is a little bit worse the lemons actually look more fake in this demonstration I think the background looks more real maybe the woman looks a little bit more real but there lemons aren't even touching the umbrella and then and they definitely just looked like a plastered over animation of some form or function they don't look very real I mean to be fair neither does this one but at least this one has more lemons and they're actually somewhat touching the umbrella in some of the cases it's a difficult prompt for sure neither of the generators really come close to nailing this one but W 2. 1 is close enough to V2 I think that a lot of people might actually prefer this generation so yeah at the end of the day guys W 2.

1 is an absolute butt kicker of a model open source runs on local hardware can be used for free on certain websites it's definitely cheaper and more efficient to run than a lot of Clos source and other open- source video generators just mind blown this absolutely could be the stable diffusion moment for video generation no doubt about it the community is already hard at work with this model trying to get things like Lura to work correctly with it and hopefully optimizing it even further I'd also love to see fine tunes for this specific model as well I'd love to see like an anime fine tune I bet that would be really cool and actually work quite well because this model seems to lend itself pretty well to animation in general at least from the examples we saw today but yeah get excited guys this is nothing but good news in my book from W 2. 1 here hopefully I can get it working locally on my machine very soon I'd love to test it out locally and include it in an upcoming video in other news GPT 4. 5 also drops today so expect a video on that pretty soon I think that might have to uh take press over my Sonet 3.