How do Graphics Cards Work? Exploring GPU Architecture (https://youtu.be/h9Z4oGN89MU?si=EPPO0kny-gN0zLeC)
Like, the "cowboy hat" example is wrong on multiple levels - GPUs are not SIMD machines, model-to-world translation doesn't work like that in practice - but you can maybe excuse it as a useful educational lie, and he does kind-of explain SIMT later, so is objecting to it valid? Or: the video claims to be about "GPUs", but is in fact exclusively about Nvidia's architecture and the GA102 in particular; is this a valid complaint, or is the lie excusable because it's just a YouTube video? Or: it overemphasizes the memory chips because of who's sponsoring it; does this compromise the message? Or: it plays fast-and-loose with die shots and floorplans; is a viewer expected to understand that it's impossible to tell where the FMA units really are? Or: it spends a lot of time on relatively unimportant topics while neglecting things like instruction dispatch, registers, dedicated graphics hardware, etc.; but is it really fair to complain, considering the target audience doesn't seem to be programmers? And so on.
Did you actually get anything out of this video? Any new knowledge? The article seems like a much more useful intro, even if it is also specific to Nvidia and CUDA.
Branch Education is designed to introduce complex concepts, often for high schoolers or newcomers to the subject. Even my first grader finds it interesting because it’s visually engaging and conveys a general understanding, even if much of the terminology goes over their head. Their video on how computer chips are made, for example, managed to hold the whole family’s attention. That is hard to do for most of the nerdy shit I watch on YouTube!
It’s not meant to be a deep dive—Ben Eater is better suited for that. His work on instruction counters, registers, and the intricacies of “how CPUs work” is incredible, but it’s for a different audience. You need a fair amount of computer science and electrical engineering knowledge to get the most out of his content. Good luck getting my family to watch him breadboard an entire graphics system; it’s fascinating but requires a serious commitment to appreciate fully.
BUT... both the video and the article are useful before you do that. They both allow you to build a mental model of how GPUs work that you can test later on.
Some discussion then: https://news.ycombinator.com/item?id=37967126
You don't call functions to tell the GPU what to do - you record commands to a buffer, that the GPU later executes at some indeterminate point. When you call dispatch/draw(), nothing actually happens yet.
Another kind of misconception: data transfer is a _really_ overlooked issue. People think "oh this is a parallel problem, I can have the GPU do it" and completely discount the cost to send the data to the GPU, and then get it back. If you want to write 20mb of data to a buffer, that's not just a memcpy, all that data has to go over the PCIe buss to the GPU (which again, is a completely separate device unless you're using an iGPU), and that's going to be expensive (in real time contexts). Similarly if you want to read a whole large buffer of results back from the GPU, that's going to take some time.
For data transfers, my experience is that you rather quickly hit that bottleneck, and it's a tough one. And it's not proportional to the number of transferred bits: transferring one byte naively can be extremely costly (like half the performance, I'm not joking)
Does having a unified memory, like Apple’s M-series chips, help with that?
When you are running your code on an M- or A-series processor, most of that stuff probably ends up as no-ops. But worse case is that you copy from RAM to RAM, which is extraordinarily faster than pushing anything across the PCIe bus.
iGPUs have less latency, but also much less bandwidth. So all those global memory fetches in modern GPU algorithms become much slower, when looking at a birds-eye level of overall throughput across the dispatch. It's why things like SSAO are way more expensive on iGPUs, despite needing to operate at a lower resolution.
I suppose next you’ll also say the GPU memory is designed to be accessed in a much more “parallel fat pipe” way that can shove gobs of data across the bus all at once vs the CPU which doesn’t have that requirement?
I mean the whole idea is “single instruction multiple data” and GPU takes that to the extreme… so yeah I guess the data pipeline has to be ultra-wide to shove shit across all the little cores as quickly as possible.
I feel these sort of "operational" questions are often neglected in the discussions, but considering how GPU is increasingly being used in wide range of applications (both for graphics and compute) I think it's becoming relevant to think how they play together.
But within your own application, yes, you can create multiple streams and assign each a priority.
If you are in Apple land, you can set the priority of your Metal command queues. Even with the integrated GPU, it is logically a separate device and you have limited control over it. And the comment about having its own compiler allows for some weird behavior. For example, it is possible to use an iPad to do Metal GPU programming using the Swift Playgrounds app: Use the standard boilerplate Swift code that sets an app up for executing a Metal shader. But instead of telling it to run a compiled kernel from the kernel library that would have been created during the application compile step on a desktop PC, you pass it Metal C++ source code as a plain old Swift string. The Metal runtime will compile and run it! You get no warnings about errors in your code; it either works or it doesn't, but alas, it shows that the compiler is indeed somewhere else :)
You're right that this is an actual consideration for programs. I don't know about CUDA/GPUGPU stuff, but for games, you need to manage residency of your resources, essentially telling the driver which resources (buffers/textures) are critical and can't be unloaded to make space for other apps, and which aren't critical.
It took a while before SSD’s stopped using SATA despite SATA being a huge bottleneck. Now it uses something else whose name I cannot recall. Surely there is work to do something like that for the GPU.
Because while it’s been a long while since I’ve built a computer I do know that the video card has always been the peripheral that pushed the limits of whatever interconnect bus existed at the time. There was all kinds of hacks for it like AGP, Video Local Bus, and then even PCI express to a large degree.
> Try this on: "A non-trivial number of Computer Scientists, Computer Engineers, Electrical Engineers, and hobbyists have ..."
> Took some philosophy courses for fun in college. I developed a reading skill there that lets me forgive certain statements by improving them instead of dismissing them. My brain now automatically translates over-generalizations and even outright falsehoods into rationally-nearby true statements. As the argument unfolds, those ideas are reconfigured until the entire piece can be evaluated as logically coherent.
> The upshot is that any time I read a crappy article, I'm left with a new batch of true and false premises or claims about topics I'm interested in. And thus my mental world expands.
https://umbrex.com/resources/tools-for-thinking/what-is-stee...
If you see people doing this, befriend them because it means they're valuing knowledge higher than their ego.
I suppose it always depends on the goals of the conversation and the participants' greater view of the world. Your advice also seems similar to the sage wisdom along the lines of "It's the mark of an educated mind to entertain a thought without accepting it" but goes one step further.
I generally agree with you, it truly is rare to find people who can put aside their ego for the pursuit of a higher (or common) goal.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.