Flux.1 IMG2IMG Using LLMs for Prompt Enhancement in ComfyUI!
53.21k views2966 WordsCopy TextShare
Nerdy Rodent
While we wait for the Flux.1 Dev controlnet models and ip adapters in ComfyUI, how about some compos...
Video Transcript:
the flux models for generating images have been out for days now yet somehow we haven't seen IP adapter or control net for it while it's tempting to wish for everything everywhere all at once with zero effort there are still some exciting tricks we can use in comy UI to enhance our image Generations techniques like imageo image and integrating large language models can take our Creations to new weird and wonderful places for example but how does all that work step into my rather nerdy borrow dear friend and I'll share a few of those secrets with you right now as you'll remember from my last video you can get various flux workflows to start with from comfy UI example so start off with one of those to use these workflows you'll need to have comfy UI installed so if you haven't done that already and you do need more help do check out the links in the video description it's really easy and shouldn't take you longer than about 5 minutes if you do already have comy UI installed remember that support for this model is is new so older versions of comfy UI won't have the new features and you'll see errors here in comfy UI manager at the bottom of the little window there you can see the version of comfy UI I am using if you've got an old version try clicking the update all button and that should keep you up to date and of course don't forget to restart when prompted when opening any workflow in com for you are you may get messages about missing nodes and here install missing custom nodes is one way to do exactly that for any workflow which says it has missing nodes okay back to the workflows then you've got a couple of options here Dev or Schnell the outputs I'm going to be creating today will be using their Dev version if we scroll down we've got some new bits as well are simple to use fp8 checkpoint the difference between the two is that schel is great for creating images in just four steps but Dev can create even better quality images they do have different licenses though with Dev being non-commercial and Schnell being Apache licenses always confuse me but it seems to say here that outputs are okay for commercial use so I should be okay including them in this video all you need to do here is download that one checkpoint file and then drag the workflow into comfy while it's not as good as the full model they've got here fp8 degrades the quality a bit so if you have the resources use the full official 16bits it is great for those with lower-end Hardware okay so let's do that then I've already downloaded the checkpoint into my models checkpoints directory so let's take that workflow and drag it into comfy UI scroll out a bit now if I to run that it will of course generate the anime style image but you might also have noticed a couple of small changes to the original first up you can now use the standard checkpoint loader there is also a new f luux guidance node to simulate CFG and it's using the standard K sampler though be sure to keep the CFG in that one to one adding new nodes to any workflow is as simple as a double click and then search for the one you want and if you're looking for more help on how to customize any workflow do check out my beginner guide Linked In the video description okay now we're up to speed let's take a look at things like image to image and using large language models to help with our image Generations image to image first then and you might have noticed that I like colors and groups so I've used lots of them here there are also a couple more new nodes such as model sampling flux and also the clip text encode flux model sampling flux here in purple has some values you can change but these are the defaults and work well I've not really got the hang of what they do and why I'd even want to change them yet but well there you go there's some op option to play with clip text encode flux is a text input node which both splits out the options between clip L and T5 XXL as well as providing that flux guidance option we saw before built in at the bottom to start with I'm prompting for anime art style person for the style to change of the input image in order to modify the original text to image workflow and make it into an imageo image version all you really need to do here is input an image latent instead of an empty one which is exactly what this image group does here just above I've added a latent input switch from The Comfy roll Studio notice how you can see where the various nodes came from by their little tag at the top which um you can actually set over here in comfy UI manager there where it says badge nickname the little fox icon just means that it's built in and so you don't need any custom node installs to get that now the nickname is generally correct but node conflicts mean this isn't always the case anyway with the switch set to one it will use the empty latent so that's basically just text to image whereas set to two it will be running in image to image mode over to the side here I've also got a bunch of nodes for doing maths and an example case where the node badge isn't correct the get image size node isn't from YOLO and I don't even have that in installed in fact that's from the stability comy UI nodes but hey um that's where notes come in handy there's probably a node which does this already but just doing this math is faster than for me trying to sege one out oh what's that why am I doing all this math well why would you not just do some random maths in a workflow oh I see you prefer things that actually do something okay well as you might have guessed already not all images have good sizes and like our good friend empty sd3 latent image we want multiples of 16 this image of Bill Clinton here I've cropped down to 2299 by 3000 for this very purpose now first up a resolution like that is above the 2K limit for getting reasonable images out of flux you might get lucky but for the most part resolution wise with flux you can generally go down to around 512 by 512 and up to 2048x 204 or8 and of course the larger your image the longer it's going to take to generate so you'll often want to shrink those down however watch what happens if I bypass these Mass nodes here and just do the typical 1. 25 scale so I can go up there and bypass group nodes cool the image looks great but hold on if we zoom in a little bit what we've got all these strange pixels and things going on around the Border what's going on there well that's because of the resolution so that's what I'm doing here with these maths nodes we can set group to always and then if I run it through again then this time we get a much better image it hasn't got all those issues so basically it's doing the same thing as the empty sd3 latent image making sure it's a multiple of 16 so in this case 992 by 1296 and now when we zoom in and take a look all those issues around the Border have gone one final and very important note about using image to image is the D noise if you've used stable diffusion before you'll probably be used to much lower D noise values than you need with flux for this photo to animate example watch what happens when I set the D noise to 0. 75 which is typically quite high as you can see it hasn't really gone very anime at all it is of course a little bit more similar to the original but if you want to do this if you want to change things from photo into anime you will need to have a much higher D noise there we go then another example on 0.
85 D noise definitely an anime style and it's kept the flag in the background which is quite interesting you can go the other way to making realistic images from paintings here I've got a different prompt so I'm asking for a cinematic photo style and I've got an input image that is a painting of an angel in this case the D noise of 0. 75 was actually okay I found it typically tends towards photo realism with image to image you aren't limited to just changing Styles and of course you can prompt for whatever you want with the results being Loosely based on your chosen picture so here I've got a D noise of 0. 91 so quite High I'm asking for an oil painting of a woman standing in front of a beaver and there we've got the input image the result then looks pretty good we've got you know basically the composition now the D noise there is on 0.
91 and that is about as high as you can go but we've got the beaver he's making his Dam and we've even got the text in there as well as we've seen even though we don't have things such as control Nets even a basic image to image can give you some degree of composition control now this is all very well and good but can we do more things well yes yes we can but we'll have to invite some AI friends this workflow is very similar but instead we're getting an AI to do the prompting for us welcome to Florence Florence is a lovely small AI which looks at images and can tell us stuff about them it can do things other than captioning if you look down here we've got task you crack that open and there's all sorts of things region captioning and segmentation it can do document Q&A all sorts of stuff but today we're just interested in Florence looking at the image and then providing a text descript description in this basic form you can essentially avoid having to type anything as Florence will do that for you via these two simple nodes unfortunately once again the nickname is mistaken here so for the Florence nodes you should instead install this one I've got here ID number 267 comy ui- Florence 2 the first node download and load Florence 2 model will as the name suggests automatically download and load your selected model the second one Florence 2 run has the task set to more detailed caption and you can see we've just got the caption output connected that goes up to the clip texting code which goes up to form the T5 XXL input you can connect it to clip L as well but I tend to just put random style tags in there myself and that's it now it will do the prompting for you down here at the bottom I've also connected a show text node so that is the prompt it's generated the image is a painting of a large circular fountain in the center of a mountainous landscape and loads of other stuff as well as we saw at the beginning I think the output is really cool and that's 0. 85 D noise as well so we're putting in that original input like in image to image getting Florence to do the prompting for us and now we can have loads and loads of different variations of whatever your input image is using an empty latent or setting the D noise to one will net you an image based on the text alone this will be very different to your input image but also should be fairly similar thanks to the long and descriptive prompt if you're not having fun yet how about a party it's an llm party uh literally that's where this set of custom nodes came from comy UI llm party got some comfy roll studio in there too and a simple text input as Primitives are a bit Limited but all nice and straightforward now this can do a whole bunch of things including automatic random prompting so hold on to your hat as this has a couple of bits to it with llm party you can connect to yes you've guessed it a whole bunch of large language models I like to run everything locally so I went with the olama option ol is really easy to install and has a cute logo just one command even on Microsoft Windows on Linux at least it installs as a service so you don't really need to do much else other than pick a model or two take a look at the table below those installation instructions and they give you some great starting examples and sizes of the various models to download a model just run ol Lama pull and then the name of the model you want so for example in this command I'm going to pull the Llama 3. 1 model there we go now you can also use it to update as well as you can see there I ran it again and it just compared and got only the differences whichever model you picked use that same name in the node like you can see here I've got model name set to llama 3.
1 the node next to it here the API large language model is the one that does the magic the text in this little box is the system prompt being an llm you can put anything you want in there so basically I'm attempting to emulate an art critic I know right who'd want to do that anyway much like with Florence this will also output a prompt which you can view in the little box next to it because you can do so many things with this I've also added a couple of switches for the prompts going in the text input switch you've got Florence OR user input and then going out you've got llm enhanced which is the output from this the Florence one directly or the user input still with me we've basically got two llms now yes I know there's an image input I'll probably get around to playing with that at some point but basically we've got a picture describer and a prompt enhancer cool now let's see what it can do not a bad output I like how it's got the books in the background very cool and of course you can see the prompt down at the bottom here which is very descriptive that's on 0. 85 D noise so it's fairly similar to the original let's just zoom out so you can see the whole thing so there you go an AI enhanced image to image this time I've increased the D noise to 0.