Using black magic to make a fast circular buffer (2017)

68 points by fanf2 16 hours ago | 27 comments

eulgro 14 hours ago |
What other kinda of optimizations can be made by using the page table?
For example, from what I understand, allocators operate on virtual memory, so if for example a vector increases in size past its capacity, memory has to be reallocated and the existing content copied. Wouldn't it be possible to perform that copy by updating virtual addresses in the page table, with no copy actually being performed in physical memory?
ncruces 13 hours ago |
If the array is large enough to span a page, and the memory was obtained from the system with mmap, you can extend it with mremap, yes.
https://man7.org/linux/man-pages/man2/mremap.2.html
superjan 13 hours ago |
There’s a great paper from the group of Emery Berger, where they merge memory pages that become sparse. A kind of garbage collection for C and C++ languages. There’s also a good presentation about it on youtube.
https://arxiv.org/abs/1902.04738
The presentation: https://youtu.be/c1UBJbfR-H0
eulgro 12 hours ago |
Interesting! Thanks for the link.
giomasce 13 hours ago |
That still requires you to plan how you use the virtual address space, though. You can't just add more memory pages on the back of your vector if you've already started using those virtual memory locations for other stuff. So you have to know in advance how much space your vector might end up using, and carefully avoid placing anything else there. If you're willing to come up with such an estimate, and if your OS is happy to overcommit memory (and I think Linux is, by default at least), then you can just malloc() all that memory in the first place. With overcommitting no physical page will be used to back your virtual memory block until you really use it.
If your system doesn't do overcommitting, then I guess that with some mmap() trickery (using the appropriate flags) you could do more or less the same thing, reserving the virtual address space that you need and then actually backing with memory as you need.
junon 13 hours ago |
Technically yes, you can do copy-on-write semantics, which marks both the original and "copied" page as read-only, and the first write to either virtual page causes a page fault in the kernel (access violation) prompting a physical page allocation, copying the page, and then updating both virtual page table entries' permissions to be RW, then the equivalent of `invlpg` to flush them out of the TLB, before returning to the program to re-try the write instruction.
However this gets tricky with heap allocators since they often pack `malloc`-like allocations into the same page using some allocation scheme (e.g. buddy allocation, etc.). Though you can typically bypass malloc-like, in-process allocators to handle page mappings yourself, it's just a little more cumbersome.
This is actually how "virtual allocation" (I think is the term, don't quote me) happens; if you `malloc` a very large space on many modern OSes, you get a pointer back and there is an allotment in the process's virtual address space registered, but no page tables are actually updated (the kernel can assume they're "not present", meaning that there is no read nor write availability; any memory operation with that range causes a page fault). Upon any page in that range being accessed for the first time, the page fault handler sees it's part of that otherwise very large segment and will allocate in physical memory to that virtual page, then return.
It allows for e.g. allocating a VERY large sparse array of items (even terabytes!) that might otherwise exceed the size of physical memory actually available to the system, with the assumption that it'll never actually be fully populated (or at least populated across enough memory pages such that you exhaust 'real', physical memory).
This is also how file mapping works. You file map something from the filesystem, and the kernel keeps a cache of read pages from that file; memory reads and writes cause page faults if they haven't been read in (and cached), prompting a read from the disk storage into a physical page, updating the page tables, and resuming the faulting thread. It also allows the kernel to selectively reclaim less-frequently-used file mapped pages for higher priority allocations at any time really, since the next time the program tries to access that page, the kernel just faults again, reads it into a new page, and updates the tables. It's entirely transparent to the process.
I'm designing a novel kernel at the moment and some of these sorts of possibilities are what I find the more interesting parts of design, actually. I find them to be underutilized to some extent even in modern kernels :)
HALtheWise 7 hours ago |
One example that comes to mind is the (abandoned) proposal for safe arena allocation in Go. It takes advantage of the fact that 64 bit computers have absolutely massive virtual memory, so that "freed" memory can be permanently unmapped from virtual memory in order to make any future accesses an error instead of a use-after-free bug.
https://github.com/golang/go/issues/51317
virexene 13 hours ago |
The author claims that mmap is exposed by glibc whereas memfd_create is not, so they reimplemented it with syscall, but looking at the man page, did they just forget to #define _GNU_SOURCE?
lq0000 9 hours ago |
When I originally wrote this, it indeed was not
defrost 13 hours ago |
> Apparently this trick has been known for a while, but is seldom used.
A long, long, long time - especially on old big iron mainframes.
Here's a virtual ring buffer in C using mmap from 2002 written by a grey beard former mainframe coder, Philip Howard aka skapare:
https://web.archive.org/web/20100706051254/http://libh.slash...
giomasce 13 hours ago |
How does that interact with cache? Does accessing the ring buffer using the second set of mapped pages ends up using the same cache line, or is it a fresh request to main memory? If it's the latter, I guess that's has good chances of making your circular buffer slower, depending on how big it is, how does your cache work and how much cache pressure you experience. I don't think I know enough about actual caches to say whether that's probable or not.
junon 13 hours ago |
Same cache-line. CPU caches come after virtual memory translations / TLB lookups. Memory caches work on physical addresses, not linear (virtual) addresses.
Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.
CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)
saagarjha 13 hours ago |
VIVT caches exist, though.
junon 12 hours ago |
That's very true, though AFAIK they aren't used much in modern processors. It's usually PIPT or VIPT (I think I've seen more references to the latter), VIPT being prevalent because the logical address and the cache can be resolved in parallel when designing the circuitry.
But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.
EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D
tliltocatl 11 hours ago |
Original ARM (as in Acorn Risc Machine) did VIVT. Interestingly, to allow the OS to access the physical memory without aliasing, ARM1 only translated a part of address space (26 bits), the rest of it was always physical.
Nowdays, you don't see it exactly because of problems with aliasing. Hardware people would love to have these back because having to do shared-index is what limits L1 cache today. Hope nobody actually does it because this is a thing that you can't really abstract away and it interacts badly with applications that aren't aware of it.
Somewhat tangential, this is also true for other virtual memory design choices, like page size (apple silicon had problems with software that assumed 4096-byte pages). And I seriously wish for CPU designers not to be all to creative such hard-to-abstract things. Shaving some hundred transistors isn't really worth the eternal suffering upon everyone who have to provide compatibility for this. Nowdays it's generally recognised (RISC-V was quite conscious about it). Pre-AMD64 systems like Itanium and MIPS were total wild west about it.
Another example hard-to-abstract thing that is still ubiquitous is incoherent TLBs. It might have been the right choice back when SMP and multithreading was uncommon (a TLB flush on a single core is cheap), but it's certainly isn't true anymore with IPIs being super expensive. The problem is that it directly affects how we write applications. Memory reclamation is so expensive it's not worth it so nobody bothers. Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere. Which means it's hard to quantify the real-life benefit of TLB coherence, which makes it even more unlikely we ever get those.
junon 11 hours ago |
Thanks for the information!
> Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere.
Curious if you could expand on this a bit; memcpy still requires that two buffers are mapped in anyway. Do you mean that avoiding maps is more important than avoiding copies? Or is there something inherent about multiple linear addresses -> same physical address that is somehow slower on modern processors?
tliltocatl 4 minutes ago |
Assume an (untrusted) application A wants to send a stream of somewhat long (several tens of KB/multiple pages each) messages to application B. A and B could establish a shared memory region for this, but that would possibly allow A to trigger a TOCTOU vulnerability in B by modifying the buffer after B started reading the message. If page capability reclamation would have been cheap, the OS could unmap the shared buffer from A before notifying B of incoming message. But nowadays unmapping requires synchronizing with all CPUs that might have TLBs with A's mapping, so memcpy is cheaper.
fanf2 9 hours ago |
Original ARM (ARM1 and ARM2) were cacheless; ARM3 was the first with a cache.
The CPU’s 26 bit address space was split into virtually mapped RAM in the bottom half, and the machine’s physical addresses in the top half. The physical address space had the RAM in the lowest addresses, with other stuff such as ROMs, memory-mapped IO, etc. higher up. The virtual memory hardware was pretty limited: it could not map a page more than once. But you could see both the virtually mapped and physically addressed versions of the same page.
RISC OS used this by placing the video memory in the lowest physical addresses in RAM, and also mapping it into the highest virtual addresses, so there were two copies of video memory next to each other in the middle of the 26 bit address space. The video hardware accessed memory using the same address space as the CPU, so it could do fast full-screen scrolling by adjusting the start address, using exactly the same trick as in the article.
truefossil 13 hours ago |
Linux, FreeBSD seem to have memfd_create(). Mac OS seem to have shm_open() instead. Actually, the first two have that one as well. Interesting.
The technique itself is indeed very elegant.
vdupras 12 hours ago |
> The fact that a modulo operation is necessary to index into the array makes this function hard (if not impossible) to vectorize, and thus unnecessarily slow.
It seems to me that this benchmark conflates two optimizations that could be tackled separately.
For example, if you took the naive approach and just use a power-of-2 buffer size so that modulo could be replaced with "&", you'd get a pretty good speedup without resorting to mmap, no?
coldpepper 11 hours ago |
Exactly. I don't know how could the blog post author have missed this.
lq0000 9 hours ago |
Author here. I was 22 when I wrote this nearly 8 years ago. I just thought it was neat, give me a break. Thanks.
superjan 8 hours ago |
Don’t worry. It _is_ a neat trick, and it was nice to post complete instructions and benchmarks. And it wasn’t you who put it to the HN frontpage.
tom_ 10 hours ago |
But it's the same thing: extra per byte overhead. You can polish this turd, or (better, in my view) give up entirely and do what the subsequent snippet does and copy the data in 1 or 2 blocks.
Whatever you do, there's another advantage to the mmap trick: if the items are variable sizes, they never need to be split. (Split items are often a pain to deal with.) The circularity becomes the buffer's problem alone, and consumers don't have to handle it.
Another article about this here: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... - see also the comments.
vdupras 9 hours ago |
Certainly, but by conflating the two, it makes the mmap trick appear better than it is in reality. There are good chances that a "&" change brings a very significant speedup.
tom_ 4 hours ago |
I updated the author's benchmark: https://gcc.godbolt.org/z/bxEzerjc8
While updating it, I noticed a couple of things:
1. the Makefile doesn't specify -O, which seems to imply -O0, unreasonably flattering the 1-memcpy case because there's only 1 call to a generic memcpy rather than 2. With -O3 the memcpys get inlined
2. make the message size more annoying. The queue size was a multiple of the message size, so the 2-memcpy case was never exercised - and -O3 could see this
(Change 2 actually slightly cancels out change 1, and now the 2-memcpy case is terrible again...)
Who even knows what sort of system the Compiler Explorer executor runs the programs on, but this is what I got from running it, and the relative speeds might be representative:
%: 0.033716 microseconds per write &: 0.031772 microseconds per write slow: 0.009715 microseconds per write fast: 0.001035 microseconds per write
ahartmetz 14 minutes ago |
Actually, even "if (pos >= bufSize) { pos -= bufSize; }" is much faster than pos = pos % bufSize. Division (and modulo) are like the only basic math operations that a modern wide ALU doesn't easily do three to four of per cycle (they are much more expensive), OTOH easily predicted branches are fairly cheap (or even compiled into conditional moves here).