What Every Developer Should Know About GPU Computing (2023)

175 points by mooreds 3 days ago | 42 comments

cgreerrun 3 days ago |
This video is a great explainer too:
How do Graphics Cards Work? Exploring GPU Architecture (https://youtu.be/h9Z4oGN89MU?si=EPPO0kny-gN0zLeC)
penr0se 2 days ago |
Wow, this is one of the best videos I've ever watched. Thanks for sharing
Hunpeter 2 days ago |
I've been binging Branch Education the last week or so, and I concur that the videos are exceptionally well made. Some commenters noticed one or two mistakes in some of them, but nothing major.
cgreerrun 2 days ago |
You're welcome! I highly recommend the CPU/microchip one too.
ferbivore 2 days ago |
These videos always give me an unpleasant feeling that I don't know how to express in a useful way. Almost everything in this one is oversimplified, misleading, or wrong, yet I feel like any attempt to argue for this would come off as pedantic; any individual complaint could be countered by either "it's technically true" or "it's just entertainment".
Like, the "cowboy hat" example is wrong on multiple levels - GPUs are not SIMD machines, model-to-world translation doesn't work like that in practice - but you can maybe excuse it as a useful educational lie, and he does kind-of explain SIMT later, so is objecting to it valid? Or: the video claims to be about "GPUs", but is in fact exclusively about Nvidia's architecture and the GA102 in particular; is this a valid complaint, or is the lie excusable because it's just a YouTube video? Or: it overemphasizes the memory chips because of who's sponsoring it; does this compromise the message? Or: it plays fast-and-loose with die shots and floorplans; is a viewer expected to understand that it's impossible to tell where the FMA units really are? Or: it spends a lot of time on relatively unimportant topics while neglecting things like instruction dispatch, registers, dedicated graphics hardware, etc.; but is it really fair to complain, considering the target audience doesn't seem to be programmers? And so on.
Did you actually get anything out of this video? Any new knowledge? The article seems like a much more useful intro, even if it is also specific to Nvidia and CUDA.
cruffle_duffle 2 days ago |
Honestly, I think Branch Education does a solid job with topics like this. Yes, it glosses over details and sometimes simplifies to the point of oversights, but that’s often necessary in educational content to avoid getting bogged down. It’s a balance: if you dive too deep, you risk losing the main points.
Branch Education is designed to introduce complex concepts, often for high schoolers or newcomers to the subject. Even my first grader finds it interesting because it’s visually engaging and conveys a general understanding, even if much of the terminology goes over their head. Their video on how computer chips are made, for example, managed to hold the whole family’s attention. That is hard to do for most of the nerdy shit I watch on YouTube!
It’s not meant to be a deep dive—Ben Eater is better suited for that. His work on instruction counters, registers, and the intricacies of “how CPUs work” is incredible, but it’s for a different audience. You need a fair amount of computer science and electrical engineering knowledge to get the most out of his content. Good luck getting my family to watch him breadboard an entire graphics system; it’s fascinating but requires a serious commitment to appreciate fully.
cgreerrun 2 days ago |
The only time I really feel like I understand something is by building something with it. So actually writing a CUDA kernel to do grayscale conversion and then tweaking the code.
BUT... both the video and the article are useful before you do that. They both allow you to build a mental model of how GPUs work that you can test later on.
ChrisArchitect 3 days ago |
(2023)
Some discussion then: https://news.ycombinator.com/item?id=37967126
winwang 3 days ago |
Makes me consider writing a post on misconceptions of GPU computing, such as requiring the problem to be fully data-parallel.
etdeagle2 3 days ago |
Please do! I would love to read about it. I have been playing with GPU hash tables shared between all the threads using locking (Interlocked.CompareExchange) and such in a Compute Shader. I have been wondering if there are better ways than locking.
jms55 2 days ago |
Hash tables on GPUs are cool. You can use them for meshless radiance caches, that automatically (for better or for worse) adapt to the surrounding geometry.
jms55 2 days ago |
In my opinion, the biggest misconception around GPUs I see people have is that they don't realize it's an entirely separate device with it's own memory, compiler, scheduler, etc.
You don't call functions to tell the GPU what to do - you record commands to a buffer, that the GPU later executes at some indeterminate point. When you call dispatch/draw(), nothing actually happens yet.
Another kind of misconception: data transfer is a _really_ overlooked issue. People think "oh this is a parallel problem, I can have the GPU do it" and completely discount the cost to send the data to the GPU, and then get it back. If you want to write 20mb of data to a buffer, that's not just a memcpy, all that data has to go over the PCIe buss to the GPU (which again, is a completely separate device unless you're using an iGPU), and that's going to be expensive (in real time contexts). Similarly if you want to read a whole large buffer of results back from the GPU, that's going to take some time.
physicsguy 2 days ago |
I found that, I recently worked on a project where there was a Python library using pymc3, scikit-learn and pytensor that is used for time sensitive calculations. The performance wasn't very good and someone was pushing for us to spend a lot of time moving it to GPU, but as I had to point out, that means almost a total rewrite, not simply switching to the GPU enabled version(s) of those libraries, where they even exist.
wiz21c 2 days ago |
Another one I had when starting was to underestimate the power of the GPU. I really needed to increase the size of my toy problems to actually see the benefits of using a GPU.
For data transfers, my experience is that you rather quickly hit that bottleneck, and it's a tough one. And it's not proportional to the number of transferred bits: transferring one byte naively can be extremely costly (like half the performance, I'm not joking)
ElFitz 2 days ago |
> Another kind of misconception: data transfer is a _really_ overlooked issue. […] If you want to write 20mb of data to a buffer, that's not just a memcpy, all that data has to go over the PCIe buss to the GPU […], and that's going to be expensive (in real time contexts). Similarly if you want to read a whole large buffer of results back from the GPU, that's going to take some time.
Does having a unified memory, like Apple’s M-series chips, help with that?
b3orn 2 days ago |
In theory yes, because you wouldn't need to copy the data, in practice it depends on the API and you might end up copying data from RAM to RAM. If the API doesn't allow you to simply pass an address to the GPU then you need to allocate memory on the GPU and copy your data to that memory, even if it's unified memory.
Kon-Peki 2 days ago |
For Apple specifically, you have to act as if you do not have unified memory because Apple still supports discrete GPUs in Metal and also Swift is reference counted - the CPU portion of the app has no idea if the GPU portion is still using something (remember that the CPU and GPU are logically different devices even when they are on the same die).
When you are running your code on an M- or A-series processor, most of that stuff probably ends up as no-ops. But worse case is that you copy from RAM to RAM, which is extraordinarily faster than pushing anything across the PCIe bus.
trq01758 2 days ago |
And if one is using iGPU, one might think I'll have a great bandwidth, but reality is that DDR memory for CPU is optimized for low latency not bandwidth and they'll probably have a 64 bit channel (or 2x32 bits) from a single DDR module or 128 bit in dual channel configuration, while something like RTX 4090 will have onboard graphics-DDR GDDR memory on 384 bit channel very much optimized for bandwidth and not latency pushing according to specs a terabyte per second. Apple really needed their memory architecture - having a high memory bandwidth for onboard GPU to have reasonable performance.
jms55 2 days ago |
Yep, this is another great callout. Desktop GPUs are (in my experience) often heavily memory limited, and that's with their big high bandwidth memory chips. The latency is a problem, but latency hiding means overall throughput is good and it works out in practice.
iGPUs have less latency, but also much less bandwidth. So all those global memory fetches in modern GPU algorithms become much slower, when looking at a birds-eye level of overall throughput across the dispatch. It's why things like SSAO are way more expensive on iGPUs, despite needing to operate at a lower resolution.
amelius 2 days ago |
What is holding manufacturers back to create a bus for the GPU that is as fast as main memory?
sokoloff 2 days ago |
PCIe x16 and DDR bandwidth are in the same order of magnitude already (depending on the exact version of each), but around a factor of 16 slower than internal GPU memory.
cruffle_duffle 2 days ago |
So the system memory is already slower than the memory on the GPU card?
I suppose next you’ll also say the GPU memory is designed to be accessed in a much more “parallel fat pipe” way that can shove gobs of data across the bus all at once vs the CPU which doesn’t have that requirement?
I mean the whole idea is “single instruction multiple data” and GPU takes that to the extreme… so yeah I guess the data pipeline has to be ultra-wide to shove shit across all the little cores as quickly as possible.
zokier 2 days ago |
While I do understand conceptually that GPU is basically its own computer, I struggle to understand how this works in terms of operating systems and multitasking. Fundamentally managing resources between tasks is one of the core functions of operating systems, and stuff like CPU schedulers and virtual memory are fairly well understood. But how are the resources on GPUs managed? If I have n processes doing GPU compute (/graphics), how are the limited GPU time and memory allocated between them? Can you set priorities and limits like you can with other resources?
I feel these sort of "operational" questions are often neglected in the discussions, but considering how GPU is increasingly being used in wide range of applications (both for graphics and compute) I think it's becoming relevant to think how they play together.
Kon-Peki 2 days ago |
The CUDA driver handles multitasking issues between processes, as each process has no knowledge of any other process.
But within your own application, yes, you can create multiple streams and assign each a priority.
If you are in Apple land, you can set the priority of your Metal command queues. Even with the integrated GPU, it is logically a separate device and you have limited control over it. And the comment about having its own compiler allows for some weird behavior. For example, it is possible to use an iPad to do Metal GPU programming using the Swift Playgrounds app: Use the standard boilerplate Swift code that sets an app up for executing a Metal shader. But instead of telling it to run a compiled kernel from the kernel library that would have been created during the application compile step on a desktop PC, you pass it Metal C++ source code as a plain old Swift string. The Metal runtime will compile and run it! You get no warnings about errors in your code; it either works or it doesn't, but alas, it shows that the compiler is indeed somewhere else :)
adrian_b 2 days ago |
Scheduling and allocation are done by the GPU driver, e.g. the CUDA runtime, with some hardware/firmware assistance from the GPU, which also contains one or more microcontrollers which may perform some of the tasks required for this.
GregarianChild an hour ago |
Can you point me to reading material about how the CUDA runtime does this with hardware assistance? I looked but I have been unable to find any thing persuasive in this direction.
jms55 2 days ago |
The driver handles time slicing between processes, mapping virtual memory to real memory, etc.
You're right that this is an actual consideration for programs. I don't know about CUDA/GPUGPU stuff, but for games, you need to manage residency of your resources, essentially telling the driver which resources (buffers/textures) are critical and can't be unloaded to make space for other apps, and which aren't critical.
cruffle_duffle 2 days ago |
When will “they” invent a better bus to go between the system memory used by the CPU and the GPU? It seems like that is a pretty major bottleneck.
It took a while before SSD’s stopped using SATA despite SATA being a huge bottleneck. Now it uses something else whose name I cannot recall. Surely there is work to do something like that for the GPU.
Because while it’s been a long while since I’ve built a computer I do know that the video card has always been the peripheral that pushed the limits of whatever interconnect bus existed at the time. There was all kinds of hacks for it like AGP, Video Local Bus, and then even PCI express to a large degree.
forgotpasagain 2 days ago |
Yes! I always thought shared memory is bigger and registers+l1s are per core instead of per SM.
Miniminix 2 days ago |
If you are not writing the GPU kernel, just use a high level language which wraps up the CUDA, Metal, or whatever.
https://julialang.org https://juliagpu.org
schmidtleonard 2 days ago |
The big problem I've had historically with non-native CUDA wrappers is that they always seem to omit or bug some feature that is critical for my application, and the amount of debugging pain and implementation or bugfix work to get around this problem exceeds the effort "savings" of a high level interface by an order of magnitude or three.
esperent 2 days ago |
Unrelated but I absolutely love this reply from the previous time this was posted and someone complained about the line "most programmmers ...":
> Try this on: "A non-trivial number of Computer Scientists, Computer Engineers, Electrical Engineers, and hobbyists have ..."
> Took some philosophy courses for fun in college. I developed a reading skill there that lets me forgive certain statements by improving them instead of dismissing them. My brain now automatically translates over-generalizations and even outright falsehoods into rationally-nearby true statements. As the argument unfolds, those ideas are reconfigured until the entire piece can be evaluated as logically coherent.
> The upshot is that any time I read a crappy article, I'm left with a new batch of true and false premises or claims about topics I'm interested in. And thus my mental world expands.
https://news.ycombinator.com/item?id=37969305
ImHereToVote 2 days ago |
That's beautiful
dmichulke 2 days ago |
I know this process as steelmanning.
https://umbrex.com/resources/tools-for-thinking/what-is-stee...
If you see people doing this, befriend them because it means they're valuing knowledge higher than their ego.
aquariusDue 2 days ago |
Kinda reminds me of Rogerian rhetoric: https://en.wikipedia.org/wiki/Rogerian_argument
I suppose it always depends on the goals of the conversation and the participants' greater view of the world. Your advice also seems similar to the sage wisdom along the lines of "It's the mark of an educated mind to entertain a thought without accepting it" but goes one step further.
I generally agree with you, it truly is rare to find people who can put aside their ego for the pursuit of a higher (or common) goal.
amelius 2 days ago |
Seems in-line with this HN guideline:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
BillLucky 2 days ago |
great
amelius 2 days ago |
Title reminds of:
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
hermitcrab 2 days ago |
GPU are optimized for number crunching. Do they get used at all for string processing? I ask because I develop data wrangling software and most of it is string processing (joins, concatenations, aggregations, filtering etc), rather than numerical.
TheDudeMan 2 days ago |
Do you have millions of strings that need to be manipulated in the same way at the same time?
hermitcrab 2 days ago |
Yes. For example, you might want to change a column of 10 million strings from upper case to lower case. Or concatenate 2 columns to create a third column. It is not clear to me this would be any faster on a GPU.
hermitcrab 2 days ago |
Also, you might want to create a hash table from a million values in a column, so you can use this for a join.