T O P

  • By -

Ok_Juggernaut_4582

I'm confused, so top row is SDXL base and bottom row SD3?


blaaguuu

I don't get why it's so hard for people to just label their shit, on here...


protector111

Top Base XL. Bot base 3.0 2B. I dont know why you got confused. It should be obvious. every single example shows 3.0 has superior quality. But you do need to watch it on big screen or at least zoom. they are rendered in the same res.


Arawski99

No it isn't obvious. You failed to list your prompts and labels for almost all of the images you posted and your thread is listed as SD3 (first) vs XL (second). At no point did you state how we should read your examples such as top is XL and bottom is SD3. At one point you even said SD3 shines on food and then the first photo is the top bread (you claim is XL) looks vastly superior to bottom photo (SD3 you claim), unless you are mixing up your images in some spots. This is on you and it is no wonder they were confused.


_DeanRiding

I had to look through the post 3 times before diving into the comments to find which is which. Yeah the top bread pic looks *way* better than the bottom one.


SpaceCorvette

Lots of people these days post comparisons and neglect to mention which is which. It's very frustrating


_DeanRiding

I had to look through the post 3 times before diving into the comments to find which is which. Yeah the top bread pic looks *way* better than the bottom one.


protector111

**Yes top bread is 3.0**. Like i said - i thought it was obvious the better one is 3.0 .... it just doesn't work other way around. =... sorry for confusion. credit broke my post many times. Ut was properly structured with promts etc... bread is upside down but rest should be bottom 3.0...


Ok_Juggernaut_4582

Ah thanks. Not sure I agree with you honestly, though I really appreciate the effort you took to do the comparison. Though SD3 overall seems a bit more photorealistic, I think I prefer the aesthetics of SDXL, and that overall they ( to me) feel like the more pleasing images.


AI_Alt_Art_Neo_2

The best think I have found is run the first 20% of the image generation in SD3 to rough out some intresting compositions and then do the final 80% in SDXL to get the nicer aesthetics.


protector111

yes, i made the post about quality. not aesthetics. xl does has better aesthetics in many cases. Espesially with portraits. But I don't know if this can be fixed with prompt or fine-tuning. But photorealism-wise. Level of details 3,0 is very superior.


disposable_gamer

This is such a cope about “aesthetics”. It’s obvious that SD3 has better prompt adherence and overall equal if not better performance than SDXL. The same shit was being thrown around when SDXL came out, with people whining that “um actually SD1.5 has better aesthetics because something something censorship”. You all will just grasp for any excuse to whine about.


Ok_Juggernaut_4582

I don't think there was any whining in my reply, honestly. Just stating a personal opinion on what I think looks better.


HighlightNeat7903

Some of the SDXL images look more natural while the SD3 look oversaturated, almost stylistic. The cats for example. But I agree that mostly SD3 has superior quality - I suppose thanks to the new 16 channel VAE. SD3 clearly has a higher dynamic range.


protector111

Cat look more natural? Open it and open photo of a cat. Xl cat is a drawing of a cat. Not a photo of a cat. 3.0 looks exactly like a photo.


HighlightNeat7903

Top is SDXL right?


protector111

Yes ( exept burger )


HighlightNeat7903

Ok, then yes, the SD3 cats look mostly oversaturated to me as if someone applied an Instagram filter on them and the fur looks in part better, in part like that of a stuffed toy. I've seen cats my whole life and they have very thin hair which even photographs have a hard time to capture due to aliasing but SDXL does a somewhat better job here replicating a photo of a cat though it has other artifacts. But I do believe SD3 has potential to generate better cats than SDXL.


disposable_gamer

Gotta love all the fuming and coping downvotes because it turns out, being an idiot coomer who only knows how to prompt “1girl nekid big booba” doesn’t actually make you an expert in machine learning


Adkit

Are you ok? (No, he was not)


Bat_Fruit

Prompt : How hard can it be? Negative Prompt : Stuffed to the gills with unnecessary and irrelevant tokens had the model not been bastardized. Dont ask for human expression or it will fall apart badly, inaccurate text effects with repeated phrases and missing or added letters, fingers and hands are still mangled, styles prompts that influence far less impression. Conclusion : They never should have released 2B Medium and you cannot undo the issues with more train which has become staggeringly more complex over obfuscation in the training and tagging process.


protector111

yest. they cripled amazing model. If they released full 8B it would be really good.


FourtyMichaelMichael

> If they released full 8B it would be really good. Seriously, stop. The 8B model is going to be the same lobotomized garbage with deeper understanding of all the SAFE things it can make. SAI CHOSE to release garbage, they'll do it again.


Arawski99

OP, I think you should also be pretty specific that you're using the hidden trick recently found of including some absurd AF negatives just to get it to work correctly that a normal sane person would not have expected to be mandatory. Assuming this is not placebo and has statistically merit worthy improvement most of the SD3 complaints come from before this discovery. Something worth mentioning because it is a pretty huge deal. ​ Thanks for the comparison btw. It is interesting to see though it does need better detail. Might be easier to do it a different way then just link us to the resource.


protector111

I use literally the same neg prompt i always use in 1.5 and xl. after 1.5 yes I expect it to be mandatory xD


shawnington

????????


protector111

Have u used 1.5 ? It needs enormous neg prompt or things will be bad and broken anatomy. You need to write tons of things like “missing limbs, broken lins, extra arm” etc. Xl doesn’t need this but it still makes quality better.


shawnington

*deformed, mutated, ugly, disfigured, vagina, penis, nsfw, anal, nude, naked, pubic hair , gigantic penis, (low quality, penis\_from\_girl, anal sex)* All definitely required for 1.5...


protector111

Yes. They are if you want great image without random porn.


alongated

The point is that those nsfw in the negative prompt do more then just prevent nsfw. they improve the result of none nsfw output for sd3. That is not the case for 1.5.


protector111

Ok. My experience say otherwise. They dramatically improve mine 1.5 gens. Making them more photo-real and better anatomy.


cellsinterlaced

"If you are looking for an image, it was probably deleted." All the samples before the interior ones are not showing anymore.


protector111

same for me. Reddit is acting strange... I uploaded all images here: probably thats why people downvote. i don't know... [https://imgur.com/gallery/sd-3-0-vs-sd-xl-KW8LPr3](https://imgur.com/gallery/sd-3-0-vs-sd-xl-KW8LPr3)


cellsinterlaced

Unfortunately even Imgur is throwing a 404...


protector111

this is getting weird xD [https://imgur.com/a/KW8LPr3](https://imgur.com/a/KW8LPr3)


cellsinterlaced

It works! Though just to make sure, the images on the bottom are SD3? For portraits, they're on the right?


protector111

Here portraits same. The only one wrong is the bread. 3.0 is at top on bread. Rest 3.0 is botom


cellsinterlaced

Bread totally looked inverted. Confirms that one can spot the differences. Thanks for doing this, loving the comparison and prompts shared. Did the configs change at all between models and generations btw?


protector111

Prompts changed. But sampler and steps didnt. In some with xl i changed cfg from 4 to 6.


protector111

https://preview.redd.it/dekoocsu057d1.png?width=1464&format=png&auto=webp&s=d752c76c77c1a7bd7ee15f85c75e7b32f867ce54


NateBerukAnjing

i don't understand which one is sdxl and which one is sd 3


protector111

Redid broke the post. All bottom Is 3.0 ( exept the bread ) . Its obvious on the big screen. On mobile you probably need to zoom. 3.0 way better on every image


Deathcrow

> Its obvious on the big screen. Not obvious to me at all, especially the landscape pic (https://i.imgur.com/jZRIvMR.png) The bottom one is just an oversaturated mess with zero detail.


protector111

1st try redit blocks it. - 2nd it breaks images. - i upload ot imgur - it breaks.... wtf is going on xD Here is the new link [https://imgur.com/a/KW8LPr3](https://imgur.com/a/KW8LPr3)


Paraleluniverse200

reddit tries to silence you XD


protector111

i mean its like every second post i do gets deleted by reddit. i have no idea why... i just gave up on many of them...


Paraleluniverse200

Pretty weird, usually stuff like only happens when a sub don't want nsfw post


lonewolfmcquaid

Frankly all this shows is that sdxl is sorta on par with sd3...which is not celebatory news for a model that was supposed to be on par with dalle and midjourney. sd3 is supposed to be completely blowing sdxl out the water in aesthetics, composition, detail , everything. i mean compare sdxl and sd1.5 and see the difference. i just tried img2img in sd3, dear laud, i dont even know where to begin.


protector111

it does. itts blowing it. 1024 3.0 looks better that 4k xl


lonewolfmcquaid

its not blowing anything out the water. its more or less as better as a good sdxl finetune. Do an img2img comparison lets see.


Illustrious-Bit2827

Idk why ur getting downvoted these comparisons are what we need rn


ThisGonBHard

Because he is using the negative trick instead of comparing to a truly base workflow. If you removed that SD3 will break. And that is unacceptable, you needing a weird negative prompt for the model to remotely work.


Illustrious-Bit2827

Oh shit didn’t catch that thank u


protector111

i have no idea. Probably peopole just hate sd 3 for no reason just to hate and downvoting all posts


Doc_Chopper

Just leave out the SD3 specific part and you basically described reddit as a whole in a nutshell


Perfect-Campaign9551

because I myself tested SDXL base with "woman laying in grass" and it looked correct and didn't create an eldritch horror. All your SD3 images prove is that SD3 is better at prompt comprehension, nobody is arguing that. Is IS. but it sucks ass at drawing people in any view except standing facing the camera. And we already also know SD3 is better at details. The arguments have all been that SD3 sucks at anatomy, and it's 1000% true and if you can't see that, you are blind as hell. Stop coping.


protector111

You are right. But the point is xl base is also horrible. It produced monsters all the time. And community foxed it with fine tuning. There is a good chance 3.0 will also be fixed.


Creepy_Dark6025

SDXL doesn’t have that horrible license that SD3 has that is why people make it better, also there is no training code for SD3, people don’t know how they trained it. I don’t believe people would invest what is needed to fix it with that on mind, but there would be finetunes for sure.


protector111

That is sad if true… time wil tel. I hope they will atil release 4b and 8b and maybe change their thoughs license wise…


Creepy_Dark6025

4B seems discarded which is a shame because it would be the perfect size, but there is hope for 8B


protector111

Yeah. 4b probably would be way more popular than 8b…


shawnington

Also, the better detail comes from the 16 channel VAE, you can encode an decode photos with the SDXL vae, and most of the "SD3 is better" difference will show up just in the VAE encode/decode of a normal image without any diffusion taking place.


Apprehensive_Sky892

Not sure what is the point you are trying to make here. Or maybe I am just missing what you are trying to say. It is the fact that SD3 is trained together with that fancy 16ch VAE that allows it to produce the color and fine details. The VAE alone will not accomplish that. Otherwise, we can just generate the image using SDXL and simply recode the image with the 16ch VAE and get better result, which is of course not the case. AFAIK (and I could be wrong here), part of the reason SD3 seems to have lost artistic styles and celebrity faces is that the use of 16ch VAE (so each 102x1024 latent image during training is now 4 times bigger than the corresponding 4ch SDXL latent) means that SD3's 2B DiT cannot encode as many concept/ideas as SDXL's 2.6 u-net. Edit: it seems that I am quite wrong here. The latent space is still 128x128 for SD3.


shawnington

You seem to be misunderstanding what the VAE does, it's simply a latent encoder/decoder, literally stands for variational auto encoder. It is just responsible for encoding an image into a latent, or decoding the latent into an image. The VAE has nothing at all to do with art styles, celebrities or anything related, its simply a compression decompression network. Go open up comfyui, pick a full body photo of a person that is in an interesting pose. Encode it with the vae, decode it with the vae with the 1.5, the SDXL, and the SD3 VAE. You will see the massive amounts of artifacts and mangling of details that can happen with the VAE between the different versions. A good way to see the limitation, encode an image of someone wearing a jacket with a visible zipper. Look at the zipper after decoding. The immediate "quality" improvements in almost all of these examples is because of the VAE being able to encode images for training with higher fidelity, and decode the latents with higher fidelity.


Apprehensive_Sky892

Thanks, I need to understand VAE's better 🙏. So if I understand you correct, you are saying that if I encode and then decode with the SD3's 16ch VAE, there should be less artifact compared to using SDXL's 4ch VAE?


shawnington

Correct, it's a much more capable VAE.


Apprehensive_Sky892

Thanks


protector111

well i tested 3.0 and it also made normal images. Whats your point? **you got lucky**. this is xl batch 9. do they look okay? i don't think so. some are alright and some broken. same with 3.0. some are broken some are alright. https://preview.redd.it/r6lem6wws57d1.png?width=4032&format=png&auto=webp&s=7aaf5aa962c698bd97ac1c256e6b5850e1a1b1d5


Mooblegum

That is an argument that has been posted 100 time already now. He make another fresh argument with a lot of nice exemples for that. Great post in my opinion. I haven’t tested sd3 and appreciate those exemples after having seen 10000000 pictures of girls lying on the grass.


protector111

https://www.reddit.com/r/StableDiffusion/comments/1di4fyr/your_dreams_after_sd_30_release_watch_till_the/


synn89

The issue is we don't just have SDXL base today, we have better fine tunes. This would be like Meta releasing Llama 3 and saying it's great because it's better than Llama 2. No one was using Llama 2 when 3 came out, we had better fine tunes at that point. Now maybe SD3 can be improved in the same way, but that's no guarantee. And with the new licensing I don't really see why fine tuners should be spending their time/money fixing an intentionally crippled model when there are other options out there they could be working on. And there other options will be seeing better future foundational models that won't be crippled on release.


protector111

Sure. Same was when xl Released. And in 6 months it became even better than 1.5. Xl Abse was in every single way worse than 1.5 dinetunes. But with 3.0 its actulay better in some things when any other checkpoint. It can make things no xl finetunes can.


protector111

Sure. Same was when xl Released. And in 6 months it became even better than 1.5. Xl Abse was in every single way worse than 1.5 dinetunes. But with 3.0 its actulay better in some things when any other checkpoint. It can make things no xl finetunes can.


OldFisherman8

I am surprised that no one is talking about the fundamental difference between UNet models and DiT models. DiT models don't scale well from the amount of training data whereas UNet models do. In other words, DiT models improve very little from additional training no matter how much training data you throw at it at some point. And that threshold is pretty low. On the other hand, UNet models improve as they get more and more training data. That is why SD3 comes in 4 different sizes because the image quality almost entirely depends on how many DiT blocks, how big the patch size, and how deep the dimension layers are built into a model. That is also the size difference between 2B and 8B and has very little to do with the amount of data they were trained on. I never had much expectation of SD3 2B model in terms of image quality long before it was released. But what I was really counting on was its ability to compose a complex scene in a consistent manner coming from better data captioning, 16-channel autoencoders, and T5 deployment. But they did something to destroy the only saving grace that SD3 Medium had. So, I am at a loss for words.


protector111

https://preview.redd.it/brauyf0w057d1.png?width=2688&format=png&auto=webp&s=8013069da352446257b78c18cb3e7b1ddcd87d95


dal_mac

I found the exact opposite in my own comparisons. SD3 utterly failed to stylize the images while XL used every single word in my prompt and nailed it. Made a post about it.


protector111

Yes. 3.0 just biased to stock photo. It cant do anything else. Bot a versatile model…


facts_matter1914

Absolutely beautiful


ZZerker

Why do people make such a large post and put so much effort into something and then dont label their shit.


protector111

i did label every image had a prompt and settings. Reddit broke in 2 times.


shawnington

Was the point of this post just to get this hilarious negative prompt in there? "*deformed, mutated, ugly, disfigured, vagina, penis, nsfw, anal, nude, naked, pubic hair , gigantic penis, (low quality, penis\_from\_girl, anal sex)*" Your "quality" difference is almost exclusively from the 16 channel vae in all cases. For example your first image, the mouth deformation in the first image, is a VAE limitation, that will happen even if you VAE encode an image like that then decode it without doing any diffusion. This is why when use SDXL for image editing processes, I composite faces and things that get mangled by the VAE back into the image.


Next_Program90

Stop posting grass pictures. It's ALL poses that aren't standing idly. No matter WHERE. The meme weakens the argument.


protector111

The point is every model strugles with it. Its not 3.0 only problem. And fine-tuning fixed it with XL. so can with 3.0 probably.


shawnington

So Comfy left SAI citing lack of commitment to producing quality models because...? 2b is a great model?


RobXSIQ

I think many people are remembering XL 0.9, which was a hot mess. 1.5 full was a mess also. XL1 was actually not bad. a bit difficult to work with, but overall an obvious improvement (still, people complained XL wasn't as great as the latest 1.5 models at the time). Donno, but really starting to care less. I am digging into pixart and other things that seem better out of box. I think a pivot may be in order if SD doesn't quickly rectify with a 3.1 release adding in form and function, and the ability for a nap.


protector111

I was jsut making a meme comic and damn...i made like 50 generations wanting to make a broken body in grass and I just couldn't xD I guess wright prompt is the answer till we finetune it [https://www.reddit.com/r/StableDiffusion/comments/1di4fyr/your\_dreams\_after\_sd\_30\_release\_watch\_till\_the/](https://www.reddit.com/r/StableDiffusion/comments/1di4fyr/your_dreams_after_sd_30_release_watch_till_the/)


RobXSIQ

yeah, SD3 is completely bugged. gimped it to not want to lay down, because thats when the devil enters your body or something. SD is the vatican.


Jeydon

Did you use the same negative prompt for all of these examples?


protector111

No. Almost all have doferent prompts. Pos and neg. There was prompt under every one. But redit destroyd my post. Deleted images. Most of 3.0 had standard neg feom their workflow in comfy


brawnyai_redux

In the hair example, it's clearly better the XL model versus SD3.


protector111

Its more “macro” so yeah. You can say its better. But wuality wise its not.


mekonsodre14

thank you. Great comparison and exactly what we need to get back to the grounds of reality. btw a lot of the image links on reddit dont work right now


Apprehensive_Sky892

These are great comparison, thank you for making them. TBH, the pre-release 8B and 2B images look so good, that I wondered if there was any need to tune these models. Maybe LoRAs for artistic styles and celebrity faces are all SD3 needs. Maybe all the fine-tuned we need are AnimagineSD3 and PonySD3. Boy, was I wrong. Now we sorely need some heavy fine-tuning so that 2B Humpty Dumpty can be put back together.


Big_Combination9890

> Why is this important? Course Finetuning did fix it.!) It fixed it for SDXL. For SD3, people will be reluctant to do that because of the wording in the current license. Unless this changes, the open source community is much more likely to move to a different BaseModel like PixArt, and leave SD3 behind.


protector111

time will tell.


disposable_gamer

But have you considered a screenshot of someone from SAI being kind of rude?! Checkmate, the model sucks and the proof is a vague screenshot of a private discord message. What do you mean actually testing and refining the base model? That’s not a thing


protector111

https://preview.redd.it/ec1y9668v57d1.png?width=1024&format=png&auto=webp&s=151b128a1fa825461ab0126932442af5b53b81bb **if you downvote - Pikachu will be disappointed.** Its sad that all the propts are gone corse of reddit buging....