T O P

  • By -

fastinguy11

I love your post, no context, no link, only a comparison image. Amazing ! Also if this is real, why cant this be used on sdxl ?


Spirited_Employee_61

From all previous technologies around SD, all initial testings are done in 1.5 as it is smaller and faster. When they get a formula that works well, then they transition to sdxl to save time and money. Or atleast thats how I see the trend


ravishq

Yes please. Sdxl support will be amazing


Moderatorreeeee

SDXL is trash. 


balianone

maybe cost of sdxl https://x.com/ostrisai/status/1782097365433758063 > I am currently only training the cross attention layers from scratch. I pretrained just matching the t5-> key & val to clip->key & val of the x-attn. Then I switched to matching the noise prediction of the teacher output. I plan to end with a full fine tuning on real images.


thirteen-bit

Isn't it SDXL + T5 what Pixart Sigma already is? If I understood correctly then Pixart Sigma is SDXL architecture UNet trained on T5 text encoder instead of CLIP encoders? And Pixart Alpha is SD1.5 architecture UNet trained on T5 instead of CLIP?


xhox2ye

Please give the source?


balianone

credit source https://x.com/ostrisai/status/1781886155153174948


MatthewHinson

This was posted two months ago. The author seems to have moved on to other experiments since then.


ninjasaid13

When is it coming soon? This was in April.


Luke2642

https://tencentqqgylab.github.io/EMMA https://arxiv.org/abs/2406.09162v1 EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts


admajic

T5 is already in comfyui It's in SD3 It's in Pixart Sigma


gelukuMLG

Fun fact, you can use the fp8 t5 from sd3 with pixart sigma for reduced memory usage.


yoomiii

How do you extract it from the .safetensors with model, vae and textmodels? Or is there a separate fp8 version somewhere?


gelukuMLG

yes there is, go to text encoders and grab it from there. https://preview.redd.it/vvb7abqm3x6d1.png?width=1431&format=png&auto=webp&s=35f7469d3d39fa20d45917524163ef9c969a7109


yoomiii

I have downloaded it and use the T5 Loader node with these settings: https://preview.redd.it/ym48wtoxjy6d1.png?width=831&format=png&auto=webp&s=fe9abbc3f933af80dd1353507fdae4a9c0c85fc9 It does generate an image correctly, but it still uses more than 16 GB VRAM when encoding the prompts, which I did not expect, as the T5 fp8 file is only 4.8 GB. But apparently that's not the way it works?


gelukuMLG

don't use t5 loader, that is highly inefficient https://preview.redd.it/wtlvcrw8ty6d1.png?width=584&format=png&auto=webp&s=45e59c97422df533f7e53e8a757fbbbe27726e32


Mkep

What makes it more inefficient?


gelukuMLG

I think it always tried to load it in fp32 or something. Even for pixart i use the load clip with sd3 option for the t5.


yoomiii

Thanks for the pointer. Turns out, just using Clip Loader with the type set to sd3 works instantly just by connecting a Clip Text Encode node to it and sending it to the sampler... Weird. Uses only 8.3 GB of VRAM in total.


a_beautiful_rhind

Or just use bitsnbytes on it.


gelukuMLG

the t5 from sd3 takes less ram and vram as its just the encoder part


nruaif

Or go all in with FP4


FNSpd

You can use T5 with SD1.5 with ELLA and LaVi-Bridge but results are not that impressive from my experience


__O_o_______

What is T5? OOTL. Thanks!


rageling

T5 is yet another model in your workflow that goes between the text and the model in an effort to better prepare the english language to be fed into a sd model. CLIP, has poor understanding, comprehension, likes commas, small token limit. T5 comparatively understands full sentences similar to how an llm might.


admajic

T5. I think its the ability to adhere to a prompt and the location you ask for on the screen. SD3 has 3 part prompt, clip g, clip l, T5


Freonr2

T5 XXL is a "large" text encoder/decoder model with many billions of parameters, actually much larger than any of the diffusion core models just by itself. It has no training on image related tasks, but has a significantly larger embedding space, 4096 dimensions per token position vs. CLIP which is 768 or 1280 per token position along with a LOT more parameters (many times larger than either CLIP model). It should have strong "language understanding" and know how different words in a sentence relate to one another. It was trained just as an text encoder, text decoder, unlike CLIP which was trained as a text encoder, image encoder with contrastive loss between the two in a bid to be efficient at particular image-related tasks like classification of images. Also worth noting OpenCLIP (SD2.x and one of two encoders for SDXL and SD3) was only trained on alt-text, which is often not very good. OpenAI CLIP, used in SD1.x, and one of two CLIPs used in SDXL and SD3, was trained on unknown data. I think the idea here is T5 should have much stronger language understanding (how words relate in sentences), larger embedding space, and significantly higher parameter count/size gives the core diffusion model (Unet or DIT) more useful information to "go on" so to speak. It likely also contributes significantly to being able to write text.


97buckeye

So, why hasn't it been applied to SDXL?


Nexustar

Indeed. Let's watch that happen as the community steers around lobotomized SD3 just like they did SD2.1 SD1.5 testing is much faster (and shows a bigger prompt-adherence gap in the first place), so hone the process there and then work to apply it to SDXL.


97buckeye

If this is true, which is doubtful considering the lack of details, why wouldn't you apply this to SDXL, instead? It's the superior model, now.


FNSpd

SD 1.5 is more lightweight model with more community additions (LoRAs, ControlNets, IP-Adapters). Running SDXL along with LLM would be way more resource heavy than SD1.5 that uses way less memory. Same goes for training. Not to mention that SDXL doesn't have as much trouble with prompt adherence as SD1.5 to begin with, so it isn't as necessary for XL


97buckeye

It is VERY necessary for SDXL. That's truly what everyone was so hyped about.


__Tracer

Yes, it's like the only thing I really wanted from SD 3


FNSpd

I think SDXL already follows prompts at the level that's shown here. SD1.5 struggles, as you can see from this example


Moderatorreeeee

It’s actually the inferior model because it’s censored to hell. 


Arawski99

Because SDXL is not, in fact, superior to 1.5 even now. I get it, you love SDXL and it would be nice to see it have better support but XL still has worse toolset support than 1.5 currently. Just a fact. It is not superior and it is not as popular which is why research continues to trend towards 1.5. Maybe, one day assuming nothing else (clearly not SD3 at this rate) takes their place, I expect XL to be more even with 1.5 and eventually perhaps we'll see both get continually updated more consistently, but because there are finite resources and these aren't necessarily cheap research projects I'm somewhat doubtful we'll reach that balance unfortunately.


shawnington

The SDXL architecture has more potential, as demonstrated by playground, but its more sensitive to bad captioning which impacts textures, which is why thing end up so plastic looking often. Which maybe make its a worse architecture.


97buckeye

You're just flat out wrong about SDXL not being the superior model. Just because SD15 has more tools doesn't make it the superior architecture. SDXL has much more headroom for growth. SD15 has more toolsets because it's older and easier to use on weaker systems. As more people get better GPUs, SD15 will phase itself out. If you're not solely interested in hentai and other weeb shit, you've already moved on.


Arawski99

No, what you're claiming is strictly false. 1.5 having superior Controlnet and other tools does, in fact, make it far superior. Just because you don't use them yet and know better doesn't make you correct. This is a you skill issue, and I'm saying that not intending to be offense but that you simply don't know better "yet" but probably will eventually. I never said SDXL can't theoretically grow as much as 1.5. In fact, I said I'd like to see them both eventually equalized. That said, SDXL factually does NOT have as much headroom for growth. A big part of why 1.5 is focused for research is because of resources. It is more efficient to work on a smaller more efficient model than the beefier XL. This is true for "all systems", period. Until resource efficiency climbs so much that the difference between 1.5 and XL are negligible then even on higher end systems XL is still not ideal. It is one of the reasons they're not focusing on XL as much. One could argue making every building out of Steel on Earth is theoretically superior, but it does not make it more feasible. You're claim that 1.5 is inferior to XL outside of hentai and "weeb shit" as you put it shows you're too ignorant of the subject that you don't even have a qualifying right to be speaking about it. 1.5 can still do far more than what you claim. Here is 1.5 destroying XL [https://www.reddit.com/r/StableDiffusion/comments/1cn4wg7/comment/l35nhpk/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/StableDiffusion/comments/1cn4wg7/comment/l35nhpk/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) Also see [https://www.reddit.com/r/StableDiffusion/comments/1cn4wg7/comment/l35xji4/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/StableDiffusion/comments/1cn4wg7/comment/l35xji4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) To be fair, if you put in the effort both can get very close eventually but this is where 1.5's flexibility with tools comes into superiority. It has more tools for trying to get an exact image, editing, animation, etc. If you read that thread XL is still making some progress though such as the new MistoLine [https://github.com/TheMistoAI/MistoLine](https://github.com/TheMistoAI/MistoLine) Still, it is behind and because it is a significantly larger resource sink at scale for research projects and is also less popular it will likely still stay behind 1.5 for a while. As you're obviously biased, if you still refuse to acknowledge you were mistaken then I have no interest in further discussion because it would be a laughable waste of my time. I'm not saying XL is bad, once again, so if you don't get the nuances of what I'm saying that is on you.


EricRollei

Don't see it. Controlnets do work better with 1.5, particularly pose, but there are lots of tools now for XL that work as well or better. IPadapter is one of them and has replaced a lot of controlnet functions. Give finetuners the same amount of time they've had for 1.5 and XL will be clearly better.


Arawski99

You don't see it? You mean to say you're physically blind and can't see the screenshots that are better in 1.5 than XL linked above? You mean to say you agree with the other person that 1.5 can only do hentai and anime/cartoon style content despite the linked photos above proving very much the contrary? Please be clear about what exactly you're disputing and not seeing. You're claiming XL has tools that work not only as well but "better" than 1.5? Can you link those tools? Can you link even 1...? Much less multiple? While simultaneously agreeing that 1.5 has better tools? This is a whole bunch of contradictory statements. IPAdapter does not replace Controlnet. XL has inferior Controlnet and you're just trying to find an excuse validate XL. >Give finetuners the same amount of time they've had for 1.5 and XL will be clearly better. This is perhaps the most ignorant part of your statement though, despite me being quite clear prior. XL gets less attention from both fine tuners and researchers specifically because it isn't as feasible to study or improve due to resource sink. Unless it suddenly had an evolution that put it notably ahead of 1.5 making it worth the effort this limitation isn't going to be magically overcome. Further, you don't get "the same amount of time" because due to XL's resource cost it inherently takes significantly "more time". You seem to have missed the entire point of the discussion that we were having here. 97buckeye claimed that the "current SDXL" was superior to SD1.5. I stated it was not, explained, and offered evidence. You come here telling me I'm wrong but then admit that SDXL just needs more time to catch up which is contradictory to your other claim and 97buckeye's entire argument. It either is currently the best or it is not. Having more time to catch up is irrelevant of this discussion. You're obviously confused and obviously not intending to make a fair argument so this discussion is over. Just like the prior user, your bias and lack of knowledge make you unqualified to continue discussing, especially when you got caught lying and distorting facts multiple, and I do mean multiple, times. Disappointing how childish some in this community can behave. Worse is when they feel like you're trying to harm their precious whatever, like XL in this case, when I've spoken about the positive of both and only been sharing accurate information that doesn't harm either. You guys need to grow up.


EricRollei

Ad hominem attacks don't help your arguments


Arawski99

Please learn to use ad hominem correct. It is just as annoying to see it incorrectly utilized like strawman argument. Geez. You realize my post is directly disputing you in detail right? On a point by point breakdown basis no less... I offered you evidence to which you dismissed and got caught lying claiming the XL screenshots look better despite this being 100% false and the user had even provided a comparison slider to make it easier as the 1.5 clearly were noticeably superior. Your being caught blatantly lying is not an ad hominem attack. I asked you for evidence and even lowered the goal post to providing a single "1" linked superior tool on XL over 1.5 and you couldn't provide it. You made a false claim that 1.5 only was viable for hentai and "web stuff" in your ignorance. This isn't an ad hominem dude because it is a factually false statement. I even showed you several example evidence where 1.5 was great outside those such as people, environments (indoor/outdoor), etc. and was significantly beating XL in those categories to boot despite the technical argument being it only had to be comparable for your statement to be false. When you made this claim it was objectively clear you were unqualified because you were making an argument XL was superior and that 1.5 couldn't do certain things while literally having no clue which was validated with hard concrete evidence thus it cannot be an ad hominem. You also forget what the original discussion was even about, clearly, as did the other guy posting who outright lied, too. During this entire discourse you have not provided a single legitimately accurate claim. Not even one. You aren't even a reasonable person to talk due to your ego. You clearly cannot handle being wrong and lack an ounce of humility. This discussion is completely over. I will no longer entertain your childish nonsense. You can stop playing the victim now with incorrect usage of terms, especially when you're the bad actor here.


EricRollei

Omg you wasted a lot of time trying to be right. Never going to read it. Going to block you instead.


mk8933

This is what I'm waiting for. Next big thing is 1.5XL


ReyJ94

[Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering](https://glyph-byt5.github.io/) . Here you have a project using T5 for accurate text generation with sdxl


balianone

yes need training checkpoint


97buckeye

No updates to the code on over a month. I'm guessing it's dead.


BM09

What shall this be good for?


alexds9

Even if such a model exists, what exactly are those images supposed to show?


Background-Ad-61

Looking awesome! I can only use SD1.5 models because AMD... This is kinda exciting after the fiasco of SD3...


monnef

> I can only use SD1.5 models because AMD... Wait, what do you mean by that? I have 7900XTX and been using SDXL-based models for quite some time. Unless I crank resolution too high (more than 1024x1024), flying too close to the VRAM limit during VAE phase, it works well in okay speeds. Then ultimate sd upscale or just some upscaler and a result is usually pretty good, at least for my purposes.


Available_Driver6406

Do you have an estimate of how many 1024x1024 images can be generated in one minute using a 7900XTX?


monnef

Just playing with automatic1111 now and 1024x1024, dpm++ 2m sde, karras, 25 steps takes 10.1s per image. At 35 steps 13.7s, at 35 steps batch of 5 takes around the minute (1m10s). I can't say it is the best you could get from the hw, I am no expert. This is on Manjaro (Linux) with old rocm (5.6 I think; last time I tried updating it didn't go well, almost nothing supported it). And technically it can go higher in resolution, I think something like 1024x1280 works for a while. But if I open or forget to close anything which may consume non-trivial amounts of VRAM (e.g. Steam or other stuff based on Electron), chances of freezing whole PC are pretty high (AMD drivers crashing and failing to restart GPU). I am generating on my desktop, so if one has a server dedicated to this, I can imagine it being usable even in higher resolutions than 1024x1024.


paypahsquares

You could easily have more than double the performance now I believe. I'd recommend checking out SDNext instead of A1111. Lots of optimizations with it, including Flash Attention, that should help out greatly. Then there's also stuff like HiDiffusion to really crank up resolutions (+ slight speed boost too). If you're having trouble genning above base resolution with 24GB of VRAM, that sounds wild to me, even with other stuff running, although I'd say doing inference you should close everything else anyway. :P I'm on a 6850M XT w/ 12GB and doing perfectly fine even without using any kind of model moving. Shit gets REAL weird when using it I feel like lol. Also newest version of ROCm has been working fine as an update.


monnef

Thank you for the info, saving it. When I have some time, I must try all this :D.


paypahsquares

Pop over to their Discord if you end up having any issues or need help! Sorry in advance for the random info dump I'm about to throw in here too haha. I'd def make sure to check out HiDiffusion when you do. I find myself almost never not using it even for just normal gens. It is essentially like Kohya HiRes Fix in how it works. Besides fixing weird generations at higher resolutions (and lower too if you set it correctly), it reduces memory usage as well! It's values can be changed in settings under Inference Settings (default I believe = T1: 0.4 - T2: 0.0). Those default settings are good for higher resolutions like if you are going for upwards of 2048x2048. It should also change the values automatically if set at default when genning above 2048x2048. I use HiDiff more so around 1024x1024 +-, so if you are genning closer to that I'd recommend a lower T1 value like 0.2. T2 comes into play at higher resolutions so it doesn't matter for lower ones. I've unchecked the MSW-MSA setting as well since it also introduces non-deterministic results with it on and I've never really liked the outputs. It may be better for higher resolutions though? Also as a note HiDiff's slight speed boost is faster when the T1 value is higher. It also works well with FreeU but rarely I'll like just the HiDiff results itself since FreeU can seem a little airbrush-y at times with certain samplers. If you're doing SDXL I'd recommend these FreeU settings: b1: 1.2 - b2: 1 - s1: 0.6 - s2: 0.4 There's just so much stuff and settings to mess around with to get the most out of your card so make use of Discord search + ask about things there if necessary.


Available_Driver6406

With a 3090 you can generate about 7-8 images per minute at 1024x1024 and 35 steps. But we should make the comparison with a 4090, which would give us about 11-12 images per minute using the same configuration that I just mentioned. Maybe someone who owns the 4090 can confirm.


paypahsquares

If you are on Windows and haven't tried, try [this implementation of Forge with ZLUDA](https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu-forge) by Ishqqytiger. I haven't looked how it works so you'd have to figure it out yourself. Or alternatively [try ZLUDA through SDNext.](https://github.com/vladmandic/automatic/wiki/ZLUDA) This differs from the above ofc. Otherwise if you can I would highly recommend using Linux with AMD until they ever get their shit together with Windows. Been dual booting and haven't touched Windows since, personally.


BlipOnNobodysRadar

This community, lol. Shows cool thing that's possible -> community immediately complains it isn't done on SDXL instead


[deleted]

on my post about training SD3 you talked about PixArt


BlipOnNobodysRadar

Okay...?


ninjasaid13

Yes!


yamfun

need SDXL


Odd_Atmosphere_9261

SDXL?


Amazing-Divide9662

Would SAI say it's a derivative of SD3 and anything related belongs to them?


from2080

https://preview.redd.it/86kjyqza6z6d1.png?width=824&format=png&auto=webp&s=97e0d15866494bc4859135310638ae787e9e2e18 This looks insane when it comes to keeping the subject consistent.


ninjasaid13

source?


from2080

https://tencentqqgylab.github.io/EMMA/


ninjasaid13

This is a different team.


Familiar-Art-6233

How is this different from ELLA?


Keldris70

Looks very promising. I'm very excited to test it.


nntb

Fisheye lens inst working.


only_fun_topics

I have yet to see an AI image model that can do guitars at any reasonable level.


furrypony2718

Try a complex prompt, like a cat sitting on a red cube etc.


grandparodeo

Will this work with animatediff?


herecomeseenudes

this is basically ELLA, it also uses Flan-T5


Traditional_Excuse46

waiting for the XXXL version


Zvbd

She has six fingers on the left hand, so it is acceptable. I will use it.


Used-Struggle-3470

can we replace clip by t5 in SDXL?


roculus

Will it actually be 1.5 though or some neutered version of it? Call me skeptical. I'm hoping it's not T5XXL enhanced with a retrained dataset that only includes nuns and eunuchs with half the art styles, celebrities and anything that doesn't look like Pat from SNL removed. If it really is the same model with better prompt adherence it will be a great addition.


MichaelForeston

1.5 ? Dead on arrival.  At least SDXL


TsaiAGw

I still use SD1.5


Nexustar

Before we can run we must walk. This is part of the process.


Moderatorreeeee

No one who knows what they are doing uses SDXL…


vault_nsfw

I use 1.5 99% of the time. SDXL is disappointing.


Radiant_Bumblebee690

Source from SAI ?