For example, from what I understand, allocators operate on virtual memory, so if for example a vector increases in size past its capacity, memory has to be reallocated and the existing content copied. Wouldn't it be possible to perform that copy by updating virtual addresses in the page table, with no copy actually being performed in physical memory?
https://arxiv.org/abs/1902.04738
The presentation: https://youtu.be/c1UBJbfR-H0
If your system doesn't do overcommitting, then I guess that with some mmap() trickery (using the appropriate flags) you could do more or less the same thing, reserving the virtual address space that you need and then actually backing with memory as you need.
However this gets tricky with heap allocators since they often pack `malloc`-like allocations into the same page using some allocation scheme (e.g. buddy allocation, etc.). Though you can typically bypass malloc-like, in-process allocators to handle page mappings yourself, it's just a little more cumbersome.
This is actually how "virtual allocation" (I think is the term, don't quote me) happens; if you `malloc` a very large space on many modern OSes, you get a pointer back and there is an allotment in the process's virtual address space registered, but no page tables are actually updated (the kernel can assume they're "not present", meaning that there is no read nor write availability; any memory operation with that range causes a page fault). Upon any page in that range being accessed for the first time, the page fault handler sees it's part of that otherwise very large segment and will allocate in physical memory to that virtual page, then return.
It allows for e.g. allocating a VERY large sparse array of items (even terabytes!) that might otherwise exceed the size of physical memory actually available to the system, with the assumption that it'll never actually be fully populated (or at least populated across enough memory pages such that you exhaust 'real', physical memory).
This is also how file mapping works. You file map something from the filesystem, and the kernel keeps a cache of read pages from that file; memory reads and writes cause page faults if they haven't been read in (and cached), prompting a read from the disk storage into a physical page, updating the page tables, and resuming the faulting thread. It also allows the kernel to selectively reclaim less-frequently-used file mapped pages for higher priority allocations at any time really, since the next time the program tries to access that page, the kernel just faults again, reads it into a new page, and updates the tables. It's entirely transparent to the process.
I'm designing a novel kernel at the moment and some of these sorts of possibilities are what I find the more interesting parts of design, actually. I find them to be underutilized to some extent even in modern kernels :)
A long, long, long time - especially on old big iron mainframes.
Here's a virtual ring buffer in C using mmap from 2002 written by a grey beard former mainframe coder, Philip Howard aka skapare:
https://web.archive.org/web/20100706051254/http://libh.slash...
Memory access -> TLB cache lookup -> PT lookup (if TLB miss) -> L1 cache check (depending on PT flags) -> L2 cache check (depending on PT flags, if L1 misses) -> ... -> main memory fetch, to boil it down simply.
CPUs would be ridiculously slow if that wasn't the case. Also upon thinking about it a bit more, I have no idea how it'd even work if it was the other way around. (EDIT: To be clear, I meant if main memory cache was hit first followed by the MMU - someone correctly mentioned VIVT caches which aren't what I meant :D)
But I've not designed CPUs nor do I work for $chip_manu so I'm speculating. Would love more info if anyone has it.
EDIT: Looks like some of the x86 manus have figured out a VIPT that has less downsides and behaves more like PIPT caches. I'd imagine "how" is more of a trade secret though. I wonder what ARM manus do, actually. Going to have to look it up :D
Nowdays, you don't see it exactly because of problems with aliasing. Hardware people would love to have these back because having to do shared-index is what limits L1 cache today. Hope nobody actually does it because this is a thing that you can't really abstract away and it interacts badly with applications that aren't aware of it.
Somewhat tangential, this is also true for other virtual memory design choices, like page size (apple silicon had problems with software that assumed 4096-byte pages). And I seriously wish for CPU designers not to be all to creative such hard-to-abstract things. Shaving some hundred transistors isn't really worth the eternal suffering upon everyone who have to provide compatibility for this. Nowdays it's generally recognised (RISC-V was quite conscious about it). Pre-AMD64 systems like Itanium and MIPS were total wild west about it.
Another example hard-to-abstract thing that is still ubiquitous is incoherent TLBs. It might have been the right choice back when SMP and multithreading was uncommon (a TLB flush on a single core is cheap), but it's certainly isn't true anymore with IPIs being super expensive. The problem is that it directly affects how we write applications. Memory reclamation is so expensive it's not worth it so nobody bothers. Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere. Which means it's hard to quantify the real-life benefit of TLB coherence, which makes it even more unlikely we ever get those.
> Passing buffers by virtual memory remapping is expensive, so we use memcpy everywhere.
Curious if you could expand on this a bit; memcpy still requires that two buffers are mapped in anyway. Do you mean that avoiding maps is more important than avoiding copies? Or is there something inherent about multiple linear addresses -> same physical address that is somehow slower on modern processors?
The CPU’s 26 bit address space was split into virtually mapped RAM in the bottom half, and the machine’s physical addresses in the top half. The physical address space had the RAM in the lowest addresses, with other stuff such as ROMs, memory-mapped IO, etc. higher up. The virtual memory hardware was pretty limited: it could not map a page more than once. But you could see both the virtually mapped and physically addressed versions of the same page.
RISC OS used this by placing the video memory in the lowest physical addresses in RAM, and also mapping it into the highest virtual addresses, so there were two copies of video memory next to each other in the middle of the 26 bit address space. The video hardware accessed memory using the same address space as the CPU, so it could do fast full-screen scrolling by adjusting the start address, using exactly the same trick as in the article.
The technique itself is indeed very elegant.
It seems to me that this benchmark conflates two optimizations that could be tackled separately.
For example, if you took the naive approach and just use a power-of-2 buffer size so that modulo could be replaced with "&", you'd get a pretty good speedup without resorting to mmap, no?
Whatever you do, there's another advantage to the mmap trick: if the items are variable sizes, they never need to be split. (Split items are often a pain to deal with.) The circularity becomes the buffer's problem alone, and consumers don't have to handle it.
Another article about this here: https://fgiesen.wordpress.com/2012/07/21/the-magic-ring-buff... - see also the comments.
While updating it, I noticed a couple of things:
1. the Makefile doesn't specify -O, which seems to imply -O0, unreasonably flattering the 1-memcpy case because there's only 1 call to a generic memcpy rather than 2. With -O3 the memcpys get inlined
2. make the message size more annoying. The queue size was a multiple of the message size, so the 2-memcpy case was never exercised - and -O3 could see this
(Change 2 actually slightly cancels out change 1, and now the 2-memcpy case is terrible again...)
Who even knows what sort of system the Compiler Explorer executor runs the programs on, but this is what I got from running it, and the relative speeds might be representative:
%: 0.033716 microseconds per write
&: 0.031772 microseconds per write
slow: 0.009715 microseconds per write
fast: 0.001035 microseconds per write