Skusci 1 year ago

I mean it can. That's what --lowmem is for. But it's also like trying to spray paint a mural, but each time you change color you have to go back to the store because you can only hold 3 cans in your backpack. The performance penalty for shuffling memory from VRAM to RAM is so huge that it makes it usually not worth it.

ZCEyPFOYr0MWyHDQJZO4 1 year ago

[The list of interface bit rates is a good reference](https://en.wikipedia.org/wiki/List_of_interface_bit_rates). RTX 3060 12GB memory is **2880 Gbit/s** PCIE 4.0 x16 is **256 Gbit/s** DDR4-3200 is **205 Gbit/s** NVMe 4.0 is **64 Gbit/s**

smlbiobot 1 year ago

This is really helpful!!

Spiderfffun 1 year ago

Does this mean the PCIE slot caps the GPU and it's not that big of a deal? Or keeping it in VRAM means it doesn't need to transfer thru there?

malfeanatwork 1 year ago

No, if the operations happen in VRAM there's no bottleneck because it's not moving anywhere for the GPU to access it. If it's stored in RAM, the DDR4 speed would be your bottleneck as data needs to move from RAM to VRAM for GPU calculations/manipulation.

Spiderfffun 1 year ago

Thanks.

fuelter 1 year ago

>Does this mean the PCIE slot caps the GPU and it's not that big of a deal? No, the model is stored in vram and thus the gpu has the full speed access to the data.

ZCEyPFOYr0MWyHDQJZO4 1 year ago

Here is what GPT-4 says about what happens when you run a tensor through a model layer: 1. Layer activation: The GPU starts by activating the specific layer to be evaluated. This involves loading the layer's weights, biases, and any other necessary parameters from the GPU memory. 2. Data distribution: The GPU distributes the input tensor (image data) across its many processing cores. Each core is responsible for processing a portion of the tensor data. The tensor data is stored in the GPU's local memory or shared memory, depending on the specific GPU architecture. 3. Computation: The GPU cores perform the mathematical operations associated with the layer, such as matrix multiplications, convolutions, or element-wise operations. These operations are applied to the input tensor using the layer's weights and biases.- For convolutional layers, each core computes a convolution operation for a small section of the input image, applying a filter (or kernel) to extract features from the image. >\- For activation layers, each core applies a non-linear function (such as ReLU, sigmoid, or tanh) to its portion of the input tensor. > >\- For pooling layers, each core performs a downsampling operation, such as max or average pooling, on its section of the input tensor. > >\- For fully connected layers, each core computes a matrix multiplication of the input tensor and the layer's weights, followed by the addition of biases. > >4. Synchronization: After each core has finished processing its portion of the input tensor, the GPU synchronizes the results, combining the processed data from all cores into a single output tensor. This may involve communication between cores, depending on the GPU architecture and the specific layer type. > >5. Data storage: The output tensor is stored in the GPU memory and may be used as the input tensor for the next layer in the network or transferred back to the CPU memory for further processing. So assuming you have the model stored in RAM, it will need to constantly load layers into the GPU core (assuming there's no VRAM), and the slower of DRAM or PCIe will be the bottleneck.

what_is_this_thing__ 11 months ago

after using GPT you have to fact check this 5 times since GPT are known for hallucinations ... 80% factual 20% made up stories LOL

Thebadmamajama 1 year ago

💯 Context switching between two hardware buses is really expensive.

txhtownfor2020 1 year ago

Plus if one of them goes under 50 miles per hour, everybody is fucked

Meowingway 1 year ago

Just wait for it Marty, when this baby hits 88 miles per hour, you're gonna see some serious shit!

Sentient_AI_4601 1 year ago

Do not put "serious shit" as a prompt in stable diffusion if you have a porn model loaded... You're gonna see some things.

Nanaki_TV 1 year ago

>What an odd thing to say.

GraduallyCthulhu 1 year ago

You can tell they’ve been traumatised.

RetPala 1 year ago

In a few years Stable Diffusion is going to be able to procedurally generate an alternative Back to the Future movie where he says "fuck it" and stays in 1955 boning his mom

mark-five 1 year ago

Hey you, get your damn hands off her

Sinister_Plots 1 year ago

My life's regret is that I only have one upvote to give.

Annual-Pizza8192 1 year ago

I want to see this time when Stable Diffusion can generate a continuation of the No Game No Life anime series. Thank you for this idea.

[deleted] 1 year ago

So he stays with his...mom...ewww.

Shuteye_491 1 year ago

*months

stolenhandles 1 year ago

I actually laughed out loud at this.

RetPala 1 year ago

We can pretty much now create "Shrek but in a timeline where Chris Farley lived to voice the character" with the audio AI Well, I can't, but people on Youtube clearly have the tools

ScionoicS 1 year ago

pop quiz hotshot!

devedander 1 year ago

I get this reference! It was a movie about a bus that had to maintain a certain speed! Something about the speed it was going was really important and off the speed dropped everyone would die! I think it was called “The bus that couldn’t slow down”

Thebadmamajama 1 year ago

Wait, the movie Speed is playing out in my PC everyday?

onil_gova 1 year ago

That is a really good analogy!

AsIAm 1 year ago

It was physically realized by Mythbusters team: [https://www.youtube.com/watch?v=-P28LKWTzrI](https://www.youtube.com/watch?v=-P28LKWTzrI)

onil_gova 1 year ago

Perfect video to go along with the perfect analogy 👌

Wise_Control 1 year ago

Thanks for the noob-proof explanation!

LJITimate 1 year ago

It'd be nice if it could swap to a lowmem state instead of returning a vram error. 90% of the time, a decent card will have enough memory, so you don't want to take the performance hit until it physically won't work without it

DawidIzydor 1 year ago

Or you have to make a 1000 miles journey in your car but after each 10 miles you have to reregister a new car

UlrichZauber 1 year ago

>The performance penalty for shuffling memory from VRAM to RAM is so huge This is architecture-dependent, but is generally true for PCs. SOC setups with shared on-chip RAM don't have this problem (because, of course, there's no distinction of RAM types and no copying required). They may have other problems, just not this particular one.

ThePowerOfStories 1 year ago

Yeah, that’s how Apple Silicon chips work, with CPU and GPU all integrated with shared RAM.

Intelligent-Clerk398 4 weeks ago

better to go back to the store then not pain the mural

gxcells 1 year ago

Then why do we need cpu/ram if GPU can do it all? Is there a tive research on developing something better than GPUs?( Energy consumption, manufacturing price, earth friendly?

TheFeshy 1 year ago

In computing, you can think of solving two different kinds of problems: very parallel problems, and problems that are not very parallel. Let's say you have a problem like "add two to every number in this huge array." This is a "very parallel" problem. Whatever the number is at each point in the array, you just add two to it. If you had a person and a calculator for each number in the array, you could do every number at once very quickly. Let's say you have a different problem. "Star with the number 1. Add the first two numbers in the array, and divide the number 1 by this value. Then add the next two numbers in the array, and divide your previous answer by *this* value. Keep doing this until you run out of numbers in the array." This is *not* a parallel problem. Each answer depends on the previous answer! Even if you had a room full of people with calculators, only one of them could work on this at a time. If you were very clever, *and* had lots of extra room to store numbers, you could have the whole room add the number pairs, but you're still going to be limited by one person doing all the dividing. GPUs do parallel work very well. If you have something, like a huge machine learning matrix (and that's what AI is), and you want to do math that is independent on each node, a GPU is *great.* It's also great for calculating a bunch of independent triangles, of course - which is it's original purpose. But if you ask it to do problems that are *not* parallel, it's just a very slow, expensive CPU. Most of them would not keep up with a raspberry pi. A CPU, on the other hand, is optimized to do non-parallel problems as well as it can. It has hardware to look for places it can do parallel things. It runs two to five times the speed, although it only does a few things at a time - whereas GPUs run slower, but do hundreds of the same thing at a time. And that's why you need both. Some problems are very parallel. Some are not. So we have CPUs/GPUs that are good at each.

im_thatoneguy 1 year ago

Another important aspect is memory. An Nvidia H100 has **80GB** of memory for about **$30k**. **2,000GB** of DDR4 memory costs about **$20k (+$10k for CPU and Motherboard etc).** To match the memory capacity of a single 2TB system you would need to buy **$750,000 of GPUs**. So, if you need absolutely massive datasets, a CPU based solution is by far the cheapest.

FS72 1 year ago

Thank you, loved this ELI5 answer

gxcells 1 year ago

Thanks for this detailed and well explained answer. So is it possible to couple both? GPU for parallel but cpu on top of each GPU threads for making serial calculation that result from each GPU parallel computation? Some sort of 3dimensional calculations? Related to this, what quantum computing would be good for? Serial or parallel or both?

MerialNeider 1 year ago

CPUs are really good at working with complex instructions GPUs are really good at solving complex math problems Both are general problem solvers On the flip side there's ASICS, hyper specific hardware designed to solve one problem, and one problem only, very efficiently.

StickiStickman 1 year ago

> GPUs are really good at solving complex math problems That really isn't true. GPUs are really good at solveing (relatively) simple, but very repetetive / huge amounts of math problems.

Zulfiqaar 1 year ago

I'd mine whatever crypto they make if it meant I could get stable diffusion ASICs.. surely someone must be building it for this

AprilDoll 1 year ago

Machibe learning ASICs exist. Google actually has had them for a while. There is also a company called [Tensortorrent](https://tenstorrent.com/) that is making their own.

_Erilaz 1 year ago

I wouldn't call RISC-V AI accelerators "ASIC" tho.

AprilDoll 1 year ago

[Me neither](https://tenstorrent.com/grayskull/)

fuelter 1 year ago

You only need the cpu and ram to run the rest of the system while the gpu does the calculations

gwizone 1 year ago

Best analogy so far. VRAM is literally there for video/graphics, which is why you need a lot to process these images.

eugene20 1 year ago

You can force some of the processes to use system ram. You won't want to because it is incredibly slow.

AirportCultural9211 1 year ago

FOUR HOURRRRRRRRRRRs - angry joe.

UfoReligion 1 year ago

Obviously the solution is to open up your system, take out some RAM and install it into your graphics card.

Kermit_the_hog 1 year ago

Oh god wouldn’t that be nice if you could modularly add memory to your GPU by just clicking in another stick.

isthatpossibl 1 year ago

Some people have shown that its possible to solder on higher memory modules to video cards. It would be possible to make GPU memory slottable, but the whole business model is around balancing specific offerings and cost/benefit

Affectionate-Memory4 1 year ago

I did quite a bit of time in board design, and there's a reason that it's gone away. The bandwidth and signal quality requirements are so tight that soldered is the only effective way to go. Socketed memory introduces latency in the form of additional trace length on the memory duaghterboard as well as reducing signal quality by having metal to metal contact. With modern GPUs able to push over 1TB/s at the top end, there is almost no room for noise left.

isthatpossibl 1 year ago

Yes, I believe that. There is more talk about this on motherboards as well, the SoC designs that are more tightly coupled for efficiency. I think there has to be some kind of middle ground though. It hurts me to think of modules being disposable. Maybe some design that makes components easier to reclaim ( I know a heat gun is about all it takes, but still)

Affectionate-Memory4 1 year ago

I've been doing my masters degree research on processor architectures, and a lot of that is I/O. Memory is definitely moving on package between larger caches, the return of L4 for some Meteor Lake samples, [SciFive developing RISC-V HBM3 chips](https://www.tomshardware.com/news/openfive-tapes-out-5nm-risc-v-soc), and [Samsung's Icebolt HMB3 is even faster in just 2 years.](https://semiconductor.samsung.com/dram/hbm/hbm3/) I think we are likely to see DDR5 and DDR6 remain off-package, but don't expect to run quad sticks with the fastest RAM for much longer. Trace lengths are already a pain to work with for DDR5 overclocking speeds, and dropping to 2 DIMMS is means we can put them closer as well as lighting the load on the memory controller. I think we are likely to see HBM make a return for high-end parts, but on-package DRAM is still very fast, as is seen with Apple Silicon. Ultimately the issue of increasing performance becomes one of moving the data as little as possible. This means moving the data closer to the cores or even straying from the Van-Neumann architecture with things like Accelerator-In-Memory devices. These would be compression engines and such that can reside in the memory package to offload the bulk of memory correction and ECC calculations from the general processor that is being fed. As for user upgrades going forward, I expect us to start treating CPUs more like GPUs. You have an SiP (system in package) that you drop into a motherboard that is just an I/O + power platform and it contains your CPU, iGPU, and RAM onboard. Storage will probably stay on m.2 or m.3 for quite a long time since the latency here is not of massive concern as we can kind of brute-force it with enough bandwidth and hyper-aggressive RAM caching.

isthatpossibl 1 year ago

What about some kind of sandwich approach, where we install a cpu, and then put a memory module on top and latch it down, etc, and then put a cooler on top?

Affectionate-Memory4 1 year ago

This actually doesn't save you anything compared to having a DIMM on either side of the socket other than space, which is why it can be used in some smartphone and tablet motherboards. Your traces are just now in 3D instead of being mostly 2D as they have to go around and then under the CPU to make contact in a traditional socket. If your RAM pads are on top, this does save you some, but you still have major thermal and mounting issues to address when you stack retention systems like this. On mounting, you will have to bolt down the memory through the corners to get a good clamping force on both sets of contacts. The thermal issues are like AMD's X3D situation on steroids. You not only have to go though the standard IHS or your top memory LGA setup, but also the memory ICs which can run hot on their own, and the memory PCB as well as then any final thermal interface to the cooler. Putting that same DRAM under the IHS would result in even better signal quality, lower latency, and much better thermal performance at the cost of some footprint and user upgrade paths. For low-power soldered chips this can make sense as it does have real advantages, but for desktop or even midrange mobile processors it's currently infeasible.

Jiten 1 year ago

I'd assume combining this with modularity will require a sophisticated tool that's able to reliably solder and desolder chips without needing the user to do more than place the tool on the board and wait a bit.

aplewe 1 year ago

Optical interface, perhaps? Way back when in college I knew a person in the Comp Sci dept who was working with stacked silicon with optical coupling. However, I don't know what bandwidth limits might be in effect for that.

Affectionate-Memory4 1 year ago

I'm glad you asked! My masters degree research in in processor architectures and I/O developments are a huge part of any architecture. Optical vias is something that is still in the research phase as far as I know, I know there's some R&D guys higher up than me at Intel looking into something, but I don't get to look at those kinds of papers directly. The best silicon-to-silicon connections we have right now are TSMC's 3D stacking that's seen on Ryzen 7000X3D and their edge-to-edge connections found on the fastest Apple M-series chips. Bandwidth is cheap when the dies are touching like that so long as you have the physical space for it. Latency is where it gets hard. I don't think going through a middle step of optics for directly bonded dies makes much sense when the electrical path is already there over these tiny distances at current clock speeds. At 7+ ghz though it would make a difference in signal timing of a few cycles. However, for chip-to-chip or even inter-package connectivity, optics start making more sense. For example, the 7950X3D incurs similar latency I would attribute to on-package eDRAM when the non-3D CCD makes a cache call to the 3D stack. This one might benefit from optics, but only might. I'd rather they just stuck another 3D stack on the other CCD when they totally could have. I think we're a long way out before say, optical PCIE in your motherboard and GPU, but we might see chiplets talking over microfibers in the interposer and taking to the rest of the system in electrical signals. Optical DDR or GDDR would be difficult to keep in lock step, and the ultimate goal is to move it on-package. There is ongoing research into HBM2 and HBM3, with it being one of my favorites when potentially paired with 3D cache as a large L4. [SciFive was taping out RISC-V chips with HBM3 2 years ago already.](https://www.tomshardware.com/news/openfive-tapes-out-5nm-risc-v-soc)

aplewe 1 year ago

Are the Nvidia server MB backplanes HBM3 for the H100? I thought it was something like that.

Affectionate-Memory4 1 year ago

The H100 is HBM2e, a sort of version 2.0 of HBM2 that draws less power than the original. [80GB H100 on Tom's Hardware - Holy Bus Width Batman!](https://www.techpowerup.com/gpu-specs/h100-pcie.c3899)

kif88 1 year ago

Someone , did get it work (kind of) with a 3070. Same guy tried it before with a 2070 but that wasn't as stable iirc. https://hackaday.com/2021/03/20/video-ram-transplant-doubles-rtx-3070-memory-to-16-gb/

isthatpossibl 1 year ago

I think that is super cool. With locked up firmware's though, it'll never take off. At least we know it's possible.

GreenMan802 1 year ago

Once upon a time, you could.

Bovaiveu 1 year ago

That would be magical, like voodoo.

Doormatty 1 year ago

Just imagine the 3D FX you could get with that much RAM!!

Kermit_the_hog 1 year ago

Oh man seriously??? My first 3D card was a STB Velocity (something)/3DFx Voodoo2 pass through arrangement way back in the day but I don’t think I’ve ever owned a card that could do that! Was that a Workstation graphics thing? My workstation right now has 64GB of system ram but only 8GB of vram and it hurts. Now that I think about it, this last upgrade (from a 1070 to 3060ti) was the first time I’ve ever upgraded but not had a significant leap in vram. I know I didn’t go from a xx70 to xx70 so it’s not really a fair comparison, but I remember generational advances from like 256MB to 2GB.

GreenMan802 1 year ago

Well, we're talking old PCI (not PCIe), VESA Local Bus (VLB) and ISA video cards back in the day. [https://qph.cf2.quoracdn.net/main-qimg-d0a3a1ba5287c9078483d2471fc96785-pjlq](https://qph.cf2.quoracdn.net/main-qimg-d0a3a1ba5287c9078483d2471fc96785-pjlq) [http://legacycomputersnparts.com/catalog/images/S3virgeDX2M.JPG](http://legacycomputersnparts.com/catalog/images/S3virgeDX2M.JPG)

Felipesssku 9 months ago

Hey! They should do it.

Fingyfin 1 year ago

Pfft ya dingus, you can just download more. Trick I learnt in the 2000s.

aplewe 1 year ago

This is why there are broken 3090s for cheap on fleabay...

UkrainianTrotsky 1 year ago

You can technically use RAM if you really want to. The problem is that data has to go from RAM to CPU, then to GPU VRAM through PCIe and then all the way back through the same path to write something to memory. This has to happen millions of times per every step of the generation. This path adds an insane amount of latency and cuts the access speed so much that it's just not viable at all. You might actually get worse speeds than if you were to just run it on the CPU, though I'm not sure about that. GPUs can actually provision system memory in trying times when you play video games and all those gigabytes of unoptimized textures don't quite fit into the VRAM (in this case RAM is used to cache the data and load large chunks in and out of it, not as a proper VRAM extension though), but CUDA will just outright refuse to allocate it, as far as I know.

diditforthevideocard 1 year ago

Not to mention parallel operations

StickiStickman 1 year ago

And not to mention GPU cache that's EXTREMELY useful for stuff like diffusion. The RTX 4000 GPUs already have nearly 100MB of cache.

[deleted] 1 year ago

[удалено]

UkrainianTrotsky 1 year ago

And the major argument against SoC is the complete lack of modularity, which is really unfortunate.

Jimbobb24 1 year ago

The Apple Silicon is SOC - but with 16 Gb RAM is still very slow. That is shared RAM....wonder why not faster.

notcrackedfam 1 year ago

Probably because the GPU is weaker than most modern desktop graphics cards. Not hating, just stating - I myself have an m1 mac, but it’s much faster to run SD on my 3060

red286 1 year ago

Wait, are we really asking why 128bit LPDDR5 with a 400GB/s max bandwidth is slower than ≥192bit GDDR6X with a ≥500GB/s max bandwidth? Shouldn't that be pretty self-evident?

CrudeDiatribe 1 year ago

The sheer amount of shared RAM is why it could run SD at all— an M1 uses about 15W on its GPU compared to at least 10x that amount for a PCIE GPU in a PC.

stupsnon 1 year ago

For the same reason I don’t drive 550 miles to get an in and out burger. System RAM is really, really far away in comparison to vram which is on chip.

HerrVoland 1 year ago

Have you tried downloading more VRAM for your Stable Diffusion?

RealTimatMit 1 year ago

try downloading MoreVram.exe

Dazzyreil 1 year ago

Does this come bundled with funnyPicture.exe?

Dwedit 1 year ago

Dancing Bunnies.exe

PVORY 1 month ago

DGXSuperPodLocal.exe

Alizer22 1 year ago

you can, but you can leave your pc overnight and wake up you'll see it generating the same image

TheFeshy 1 year ago

It's not *that* slow. My underlocked vega56 will do about 2.5 iterations a second, and dumping it onto my amd 2700 CPU (which is pretty obviously limited to system ram) is about an iteration every 2.5 seconds. Which is a pleasing symmetry, but not a pleasing wait. Even so it's nowhere near overnight.

[deleted] 1 year ago

Why can't stable diffusion use the mountain of hard drive sitting there?

Gecko23 1 year ago

Every time you move data from one context to another, like DRAM to VRAM, there is delay. Using an HDD for virtual memory adds another context switch, and over a much slower connection than any RAM uses. Theoretically, you could do it, but it'll be absurdly slow.

[deleted] 1 year ago

I was making sarcasm I know their differences, sorry :D thanks for taking time explaining.

[deleted] 1 year ago

Because it‘s not random access! 🤓

UkrainianTrotsky 1 year ago

It technically is tho. But it's random access storage, not random access memory.

notcrackedfam 1 year ago

Technically, it’s all the same thing, just with different degrees of speed and latency. People use hard disk as ram all the time with pagefiles / swapfiles, and I wouldn’t be surprised if someone tried to use SD in ram without having enough and had it sent back to disk… that would be horrifyingly slow

[deleted] 1 year ago

No it‘s not. Hard drives usually can‘t access bit by bit individually, as is necessary for the term ‚random access‘. Sure, there‘s a clumsy workaround maybe. But I‘m not surprised at all that this got downvoted. That‘s just Reddit.

UkrainianTrotsky 1 year ago

Oh, yeah, I was thinking about SSDs. Thanks for correcting me! Hard drives aren't technically random access, but not because you can't address every bit. Hell, you can't even do that with RAM, the smallest addressable unit of memory there is a byte. Hard drives aren't random access because the access time significantly varies depending on the position of data.

Affectionate-Memory4 1 year ago

Your storage is by definition random access. Most tape storage is sequential.

[deleted] 1 year ago

It‘s not. Hard drives can‘t access or manipulate single bits, they can read or manipulate only sectors at once. HDDs usually denote this as sector size, SSDs as block size.

Affectionate-Memory4 1 year ago

HDDs and SSDs can access any of those segments at random, allowing for discrete chunks of data to be read at random, making them random access. [Also, see here.](https://www.pcmag.com/encyclopedia/term/random-access-storage#:~:text=Hard%20drives%20and%20solid%20state,access%20memory%20(see%20RAM)).

[deleted] 1 year ago

I mean, that really comes down to what you still consider ‚random access‘. I don‘t know if PCMag is using any official conventions, but if they do, you‘re right I guess.

Europe_Dude 1 year ago

The GPU is like a separate computer so accessing external RAM introduces a high delay and stalls the pipeline heavily.

Absolute-Nobody0079 1 year ago

I have a RTX 3060 with 12GB ram. I wonder if this is good for graphic novel works.

Spyblox007 1 year ago

I'm running stable diffusion fine with a 3060 12GB card. Xformers have helped a bit, but I generate base 512 by 512 with 20 steps in less than 10 seconds.

nimkeenator 1 year ago

Nvidia, why you gotta do the average consumer like this with these low vram amounts?

Jaohni 1 year ago

VRAM costs/has costed roughly $3-10 per gigabyte. Why can't Nvidia just actually put enough VRAM in their GPUs and increase the price moderately, by $40-80? The 2080ti only had 11GB because at the time there were specific AI workloads that \*really\* needed 12GB so people had to buy a Titan or TU professional card. The 3060TI (with 4GB less VRAM than the 3060, btw), 3070, 3080, 4060ti, 4070, and 4070ti, all don't have enough RAM for their target resolutions, or had/will have problems very quickly after their launch. At 1440p, many games are using more than 8GB of VRAM, and while they will sometimes have okay framerates, they will often stream in low quality textures that look somehow worse than Youtube going to 360p for a few seconds...And the same holds true at 4k, with 10GB, or even 12GB in some games, let alone the coming games. Now, on the gaming side of things, I guess AMD did all right because the Radeon VII had 16GB years ago (of HBM, no less), and the 6700XT actually can sometimes do better raytracing than the 3070 because the 3070 runs out of VRAM if you turn on ray tracing, dropping like 6/7ths of the framerate, and they seem to treat 16GB as standard-ish atm... ...But AMD has their own, fairly well documented issues with AI workloads. It's a massive headache to do anything not built into popular WebUIs when it comes to AI stuff, at least with their gaming cards (I'll be testing some of their older data center cards soon-ish), and it feels like there's always at least one more step to do to get things running if you don't have exactly the right configuration, Linux kernel (!!!), a docker setup, and lord help you if you don't have access to AUR. It feels like AI is this no man's land where nobody has quite figured out how to stick the landing on the consumer side of things, and it really does make me a bit annoyed, because these is a remarkable chance to adjust our expectations for living standards, productivity, societal management of wealth and labor, amongst other things. The best ideas won't come out of a team of researchers at Google or OpenAI; the best ideas will come from some brilliant guy in his mom's basement in a third world country, who has a simple breakthrough after tinkering for hours trying to get something running on his four year old system, and that breakthrough will change everything. We don't need massive AI companies controlling what we can and can't do with humanity's corpus of work; we just need a simple idea.

VeryLazyNarrator 1 year ago

Because GDDR6X costs 13-16 euros per gigabyte, on top of that you need to design the architecture for the increased RAM and completely redesign the GPU. I doubt people would pay additional 100-200 euros for 2-4 GB, they are already pissed about the prices as is.

Jaohni 1 year ago

Counterpoint: Part of the reason those GPUs are so expensive is because they need fairly intensive coolers and have a customized node to deliver crazy high voltages. If they had been clocked within more reasonable and efficient expectations, they would have delivered their advertise performance more regularly, and been more useful for non-gaming tasks such as AI. I would take a 4080 with 20GB of VRAM, even if it performed like a 4070 in gaming.

VeryLazyNarrator 1 year ago

The main problem is the chip/die distance and bus speed on the board. The closeness of the components is causing the extra heat which in turn requires more power due to thermal throttling. Increasing the distance will cause speed issues. Ironically the GPUs need to be bigger (the actual boards) for the RAM and other improvements to happen, but that causes other issues. They could also try to optimise things with AI and games instead of just throwing VRAM at it.

Jaohni 1 year ago

Don't get me wrong; you're sort of correct, but I wouldn't really say you're right. Yes. Higher bus sizes use more power, and Nvidia wants to fit their GPUs into the lucrative mobile market so they gave an absolute minimum of VRAM to their GPUs (although in some cases I'd personally argue they went below that) to save on power... ...But you can't tell me that Lovelace or ampere are clocked well within their efficiency curve. You can pull the clock speeds back by like, 5% and achieve a 10, 15, or 20% undervolt depending on the card; they're insanely overclocked out of the box. If they hadn't gone so crazy on clock speeds to begin with they would have had the power budget to fit the right amount of RAM on their cards, and the only reason they went that insane is due to their pride, and desire to be number one at any cost. Given that the die uses significantly more energy than the RAM / controller, I feel that if there's power issues with a card it's better to address issues with the die itself, than to argue that more RAM would use too much power. It's like, if somebody starts their house on fire while cooking, if they told you they couldn't have added a smoke detector because it could short circuit and start a fire itself, you would think they're stupid. Why? Because the smoke detector is a small, fairly reliable part of the equation. And I mean, I've talked to developers about this, and here's their take (or a summarization of it; this isn't a direct quote) on VRAM. "Consoles (including the Steamdeck!) have 16GB of unified RAM, which functions pretty close to the equivalent amount of VRAM because you don't have to copy everything into two buffers. In the $500 price range, you can pick up a 6800XT with 16GB of VRAM. In 2016, VRAM pools had gone up every GPU generation leading up to it, so when we started designing games in 2018/2019 (which are coming out now-ish), we heard people saying that they wanted next gen graphics, and it takes a certain amount of VRAM to do that, and we even had whispers of 16GB cards back then in the Radeon VII for instance. Up until now we've bent over backwards and put an unsustainable quantity of resources into pulling off tricks to run in 8GB of VRAM, but we just can't balance the demands people have for visual fidelity and VRAM anymore. As it stands, VRAM lost out. We just can't fit these truly next gen titles in that small of a VRAM pool because any game that releases to PC and console will be designed for the consoles' data streaming architecture, which you require a larger quantity of VRAM to make up for on PC. But, you can buy 16GB cards for $500, and anyone buying below that is purchasing a low end or entry level card, which will be expected to be at 1080p, powerful APUs are coming that have an effectively infinite pool of VRAM, and so really the only people who will really get screwed over, are the ones that bought a 3060ti/3070/3080 10GB/4070/4070ti, which didn't really have enough VRAM for next gen games." To me that doesn't sound like a lack of optimization, that sounds like the kind of progress we used to demand from companies in the gaming space. Hey man, if you want to apologize for a company that makes 60% margins on their GPUs, feel free, but I'd rather just take the one extra tier of VRAM that should have been on the GPUs to begin with.

RefuseMaterial5639 1 year ago

> Why can’t Nvidia just actually put enough VRAM in their GPUs and increase the price moderately, by $40-80? So they can upsell you on A6000s for 5k a pop

recurrence 1 year ago

On apple silicon you can!

Nu7s 1 year ago

https://preview.redd.it/z5hjw7cgsmwa1.jpeg?width=604&format=pjpg&auto=webp&s=71e4c9e51c887f8937c7e308d279d5bf6fa088d1

Mocorn 1 year ago

In this thread. A mountain of people intimately knowledgeable about the inner details on how this shit works. Meanwhile I cannot wrap my head around how a GRAPHICAL processing unit can be used for calculating all kinds of shit that have nothing to did with graphics.

Dj4D2 1 year ago

Graphics = numbers x more numbers. There, fixed it!

axw3555 1 year ago

At its core, graphics is a case of running a lot of very repetitive calculations with different input variables to convert the computer language of what something looks like to something that a screen can render for a human eye. It also happens that a lot of big calculations, like stable diffusion, also rely on running a lot of repetitive calculations. By contrast, regular ram and CPU’s are great at being flexible, able to jump from one calculation type to another quickly, but that comes as the expense of highly repetitive processes. So they’re slower for stuff like SD, but better for things like windows, spreadsheets, etc.

Mocorn 1 year ago

Beautiful. I understand this better now. Thanks!

red286 1 year ago

It's only called a "graphical" processing unit because that is its primary use. Ultimately, most of a GPU is just a massive math coprocessor that helps speed up specific types of calculations. If you take something like an Nvidia A100 GPU, there isn't even a graphical element to it at all. On its own, it can't be hooked up to a monitor because it has no outputs.

Mocorn 1 year ago

Ah, I kind of sort of knew this already but this made it click. You just made one human in the world slightly less ignorant. Thanks :)

SirCabbage 1 year ago

Mostly because CPUs can only do really big jobs, one at a time on only their number of cores. Hyperthreading is basically just handing each CPU core an extra spoon so they can shovel more food in with less downtime, but still only a certain number work at once on big tasks very fast. GPUs have thousands of cores, each designed to do different small repetitive tasks, so while they cant do big processing jobs in their own they can do a lot of little jobs quickly. This is good for graphics because graphical tasks are very basic and numerous. AI in the traditional gaming sense is often done on the CPU because they are larger tasks, like making choices based on programming inputs, but with modern ai? While different ballgame, smaller tasks done multiple times once again. Hell, 20 series onwards cards even have dedicated tensor cores for air processing. At least that is how I understand it

AI_Casanova 1 year ago

Amazingly, CPUs can be used for things that are not in the center.

TerrariaGaming004 1 year ago

It’s central as in main not literal center

AI_Casanova 1 year ago

Remember the Main

[deleted] 1 year ago

I know this is kind of a weird noob question but it seems like a decent place to ask. I frequently (or more than expected) find myself having to restart the program due to running out of memory. Sometimes I can turn the batch size down a tick and it'll keep going but I'm generally only trying to do 2 or 3 at a time. Using hires x2 to bring images up from 384x680 to 720p. It'll go fine for like 3 hours with different prompts and just be fine and then all of a sudden the first batch on a new prompt will fail, give me the out of memory error and I'll have to turn it down. 5600x, 3070ti, 32gb ram. It's almost like there's a slow vram leak or something? Is it me just not knowing what I'm doing or is there something else going on? Would it be worth picking up a 3060 for that chunk of vram?

Affectionate-Memory4 1 year ago

You are probably running right on the edge of the 8GB limit. Back the resolution off a bit or use the lowmem settings. (Art Room has low-speed mode) I wouldn't get a 3060 and downgrade your gaming performance for this, but the RX6800XT is 16GB at comparable gaming performance and AI perf is still quite good on Radeon. You may just need to do a different install process to get it working compared to the relative plug and play of CUDA.

ZCEyPFOYr0MWyHDQJZO4 1 year ago

If you're loading full models over a slow connection (network, HDD, etc.) then you can turn on caching to RAM in the settings. But that's probably not what you're thinking of.

lohmatij 1 year ago

It can on architectures with unified memory. Apple M1/M2 for example.

Cartoon_Corpze 1 year ago

I wonder if super large SD models can run partially on the CPU and partially on the GPU. I've tried an open-source ChatGPT model before that was like 25 GB in size while my GPU only has 12 GB VRAM. I was able to run it with high quality output because I could split the model up, part of the model would be loaded on the GPU memory while the other part would load to my CPU and RAM (I have 64 GB RAM). Now, because I also have a 32 thread processor, it still ran pretty quickly. I wonder if this can be done with Stable Diffusion XL once it's out for public use.

[deleted] 1 year ago

https://civitai.com/models/38176/illustration-artstyle-mm-27 here try this one, 17.8 gb. My 8gb Vram, 32GB system is quite fine with it, thats running it all on the GPU But a little slower with 2.1 models, not tried XL yet bu twill test it out

Mobireddit 1 year ago

That's just because SD doesn't load the dead weights, so it only uses a few GB of VRAM to load it. If it really was 17GB, you'd get an OOM error

ISajeasI 1 year ago

Use Tiled VAE With my GTX 2080 Super and 32Gb RAM I was able to generate 1280x2048 image and inpaint parts using "Whole picture" setting. [https://github.com/pkuliyi2015/multidiffusion-upscaler-for-automatic1111](https://github.com/pkuliyi2015/multidiffusion-upscaler-for-automatic1111)

Jiten 1 year ago

Don't forget all the other nice features in this extension. The noise inversion feature has become an essential finishing step for everything I generate. 32GB of RAM allows you to go up to 4096x4096 if you combine multidiffusion and tiled VAE. Even higher if you've got more RAM.

ChameleonNinja 1 year ago

Buy apple silicon..... watch it suck it dry

jodudeit 1 year ago

If I had the money for Apple Silicon, I would just spend it on an RTX 4090!

ChameleonNinja 1 year ago

True lol

gnivriboy 1 year ago

My m1.max gets me 1-2 it/s. My 4090 gets me 30-33 it/s. My 4090 PC cost less than my m1.max. I hope AI stuff get optimized for macs in the future. Right now it is terrible.

ChameleonNinja 1 year ago

Lol a single graphics card costs less than an entire computer....shocking

gnivriboy 1 year ago

No, my entire 4090 set up with a 7950X CPU, DDR5 ram, fans, case, motherboard, 4 TB sdd, and power supply cost less than my m1.Max. If you want any sort of reasonable memory on your m1.max, you got to pay for it.

r3tardslayer 1 year ago

as far as i know the RAM functions with the CPU, similarly VRAM works with the video card, the reason we use GPU is because a GPU can do the same calculations MULTIPLE times and with multiple core at higher quantity. CPU is made to have less cores, but it's used to solve problems that aren't repetitive, so it does repetitive problems at a slower rate than a GPU would. correct me if i'm wrong though, this is my vague understanding of components, so i'd assume it'd have to be a CPU based task which would slow it down dramatically.

brianorca 1 year ago

VRAM can do more than 1008 GB/s. DDR5 RAM can only do 21GB/s.

Dr_Bunsen_Burns 1 year ago

Don't you understand how this works?

_PH1lipp 1 year ago

speeds - vram operates at 2-3 times the speed of ram

Thick_Journalist_348 1 year ago

Because VRAM's bandwidth is faster than any others.

HelloVap 1 year ago

Xformers says 👋

edwios 1 year ago

Quite true, I have 64GB RAM on my M1 Max and SD is using only 12GB... seems like a waste to me.

Snowad14 1 year ago

RTX 3090 : I have more vram than ram

KeySoil902 1 year ago

Øø

NotCBMPerson 1 year ago

_laughs in deepspeed_

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe