This patch from the article makes it take 1 cycle instead of 3:
- MOV RCX, 1
+ MOV ECX, 1
>It seems like SHLX performs differently depending on how the shift count register is initialized. If you use a 64-bit instruction with an immediate, performance is slow. This is also true for instructions like INC (which is similar to ADD with a 1 immediate).Practically speaking, is this sort of µop-dependent optimization implemented by compilers? How do they do so?
I suspect this specific thing is not; I don't think many people were aware of this anomaly until this week. I doubt the compilers are modeling it.
But in general, compilers will have a machine model that predicts the cost of different instructions. The backend will try to generate instruction sequences with low cost, so if you include this anomaly in the model it will probably work around it automatically.
This old example might interest you: a false dependency on the destination register for `popcnt` on some uarches led to compiler patches: https://stackoverflow.com/q/25078285/502399
Scheduling Model in LLVM - Part I https://myhsu.xyz/llvm-sched-model-1
You might have to look into tickless kernels too, but I think that's pretty mainstream now. Tldr, if there's no contention for the core and there's no hardware interrupts, there's no need for a timer/scheduler tick on that core.
I wonder whether IRET fixes it. Can you try MOV RCX, 1; push an IRET frame; IRET; SHLX to see if IRET fixes it?
No, if you push and pop RCX it's fast again. I mentioned this in the blog post, `mov rcx, 1` is slow but `mov rcx, 1; push rcx; pop rcx` is fast.
It is very strange to me that the instruction used to set the shift count register can make the SHLX instruction 3× slower.
I suspect this is a width restriction in the bypass/forwarding network.
The 32-bit vs. 64-bit operand size distinction is especially surprising to me as SHLX only looks at the bottom 6 bits of the shift count.
Unfortunately the dependency analysis circuitry seems not Intel-ligent enough to make that distinction.
Fast. But inc rcx is slow.
> Or the even shorter "mov cl, 1"?
Fast.
> I suspect this is a width restriction in the bypass/forwarding network...
I think we just found the explanation on Twitter: https://x.com/corsix/status/1874965887108976858
Alder Lake adds support for mov/add/sub with small immediates to the register renamer. So `add rcx, 1` gets handled by the renamer, potentially with zero latency.
Unfortunately shifts are slow when the shift count is renamed in this way. This leads to fun things like `mov rcx, 1024` being fast while `mov rcx, 1023` is slow. I'll update the blog post in a bit.
1. Assume that some values can be treated internally as "physical register plus a small immediate"
2. Assume that, in some cases, a physical register is known to be zero and the value can be represented as "zero plus a small immediate" (without reference to a physical register?)
3. Since 'count' is always expected to be <64, you technically don't need to use a physical register for it (since the immediate bits can always just be carried around with the name).
4. The 3-cycle case occurs when the physical register read cannot be optimized away??
(Or maybe it's just that the whole shift operation can occur at rename in the fast case??)
No... the 3-cycle case seems to be when the physical register read is optimized away. I think it's some kind of stupid implementation bug.
Or maybe the bypass network is bugged, and what we are seeing is a chicken bit set by the microcode that disables the bypass network for this one micro op that does shift.
Maybe if they really rely on this kind of forwarding in many cases, it's not unreasonable to expect that latency can be generated by having to recover from "incorrect PRF read" (like I imagine there's also a case for recovery from "incorrect forwarding")
I know modern CPUs will sometimes schedule uops that consume the result of load instruction, with the assumption the load will hit L1 cache. If the load actually missed L1, it's not going to find out until that uop tries to read the value coming in from L1 over the bypass network. So that uop needs to be aborted and rescheduled later. And I assume this is generic enough to catch any "incorrect forwarding", because there are other variable length instructions (like division) that would benefit from this optimistic scheduling.
But my gut is to only have these checks on the bypass network, and only ever schedule PRF reads after you know the correct value has been stored.
Essentially, Intel have allocated a bunch of fake renaming registers to hold all small immediates. Like how MIPS had a zero register, Intel now have over 1000 constant registers. These registers don't really exist, the immediate is just reconstituted at the execution unit.
And I'm guessing these small immediates predate Golden Cove; It's probably how they have always represented shift by constant and any instructions with 8bit immediate (presumably at least as far back as Sandybridge). The only change is that now the renamer can now execute simple movs/adds.
So essentially these variable shifts are being converted to constant shifts on the fly. The problem is that there is no constant shift version of shlx, so it wasn't in the test suite. There is probably some kind of stupid implementation bug in the scheduler that simply wasn't exposed until the renamer changes.
Also the problem affects SHL as well as SHLX, I didn't realize until just now.
That requires a whole bunch of extra 64-bit full-adders everywhere one of these pairs might be consumed (so realistically, on every single read port of the register file). 64-bit adders take quite a bit of latency, so you don't want extra adders on all your critical paths.
In the case where it appears to be holding a reg + offset pair, what I think has actually happened is that renamer (and/or uop fusion) has rewritten the uop to a 3-input add, with the offset as the third input.
> Also the problem affects SHL as well as SHLX, I didn't realize until just now.
And presumably SHR/SHRX/SAR/SARX too?
1. Start PRF reads at dispatch/rename
2. In the next cycle, you have the result and compute the 64-bit result. At the same time, the scheduler is sending other operands [not subject to the optimization] to read ports.
3. In the next cycle, the results from both sets of operands are available
It also adds an extra cycle of scheduling latency, which may or may not be an issue (I really have no idea how far ahead these schedulers can schedule).
If you think about the "1024 constant registers" approach: It allows you to have a small 10bit adder in the renamer which handles long chains of mov/add/sub/inc/dec ops, as long as the chain stays in the range of 1024. This frees the main adders for bigger sums, or maybe you can power gate them.
And in the case where one of the inputs is larger than 1024, or it's value is unknown during renaming, the renamer can still merge two two-input adds into a single three input add uop.
And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
In general I think people are overstating the delay of an additional 64-bit add on register file reads (though I'm not a hardware guy so maybe someone can correct me). There are carry-lookahead adders with log_2(n) == 6 gate delays. Carry-save adders might also be relevant to how they can do multiple dependent adds with 0c latency.
> And, presumably, the OP shift case here is in fact a case of there not being a built-in immediate adder and thus a need for fixup uops being inserted to materialize it?
No, the perf counters show 1 uop dispatched/retired in both the slow and fast cases.
Unfortunately shifts are slow when the shift count is renamed in this way.
That's because the value still has to make it to the shifter... and the renamer's adder output clearly takes a different and higher-latency path than the one which doesn't go through it, confirming my original suspicion that this is a limitation of the bypass/forwarding network.
Ultimately it seems they are just playing around with where things happen; the addition can be done here but the value ultimately needs to go there for the shift, or the addition and shift can both happen there. One could envision a future change that puts the shifter in the renamer too. They're really constrained by the laws of physics.
This leads to all sorts of strange performance workarounds. Mike Acton refers to it here: https://macton.smugmug.com/Other/2008-07-15-by-Eye-Fi/n-xmKD...
Even some assemblers will optimize this for you.
Given that 64-bit immediate math also causes this though, compilers might generate it (I think this is a minor missed optimization by gcc though: it could have used add eax, 1 instead.
edit: It looks like VEX.W is set in the encoding from the uops.info tests ¯\_(ツ)_/¯
Has anyone run tests with ≥3 interleaved SHLX dependency chains in a loop? Does it "just" have 3 cycle latency or also less than 1 operation/cycle sustained throughput? Because if the pipeline stalls that would be even more annoying for existing optimised code.
Is the regression limited to only Alder Lake P-cores or also present it later (refreshed) cores?
It should be possible to construct scenarios where a logical register is mapped to "physical register + immediate" where the immediate depends on the path taken through the program so far. The easiest example would be an if-statement that optionally adds 1023 to a register, and then an add of 1 after the if-statement. The add of 1 overflows the immediate stored in the register renamer if and only if the if-condition was true.
As you said, that requires taking different control flow paths through the program, which is something to be avoided by constant-time algorithms anyway.
One potential twist: This shows that it may be important that the interface between the constant-time algorithm and the rest of the program goes entirely through memory and not through registers. If the interface does go through registers, it's possible that variable register-renamer-immediates leak from the outside world into the constant-time algorithm. (This admittedly feels like a bit of a stretch, but who knows.)
so today it needs to go through SSD or maybe the network LOL. but for real conspiracy, we should use paper and pencil
- 64 bit register
0: 48 c7 c1 01 00 00 00 mov rcx,0x1
- 32 bit register
0: b9 01 00 00 00 mov ecx,0x1
It should be easy to test by adding a couple of NOPs to the fast version: 0: b9 01 00 00 00 mov ecx,0x1
5: 90 nop
6: 90 nop
and see if it regresses again.I don't have an Alder Lake to test on.