With more detailed power measurements, it could be possible to determine if this is thermal/power budget related? It does sound like the feature was intended to conserve power…
Most of the cores on CCD0 of my non-X3D chip hit 5.6-5.75ghz. CCD 1 has cores topping out at 5.4-5.5ghz.
V-Cache chips for Zen 4 have a huge clock penalty, however the Cache more than makes up for it.
Did he test CCD1 on the same chip with both the feature disabled and enabled? Did he attempt to isolate other changes like security fixes as well? He admitted “no” in his article.
The only proper way to test would be to find a way to disable the feature on a bios that has it enabled and test both scenarios across the same chip, and even then the result may still not be accurate due to other possible branch conditions. A full performance profile could bring accuracy, but I suspect only an AMD engineer could do that…
"Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue. Zen 4 could use its micro-op queue as a loop buffer, but Zen 5 does not. I asked why the loop buffer was gone in Zen 5 in side conversations. They quickly pointed out that the loop buffer wasn’t deleted. Rather, Zen 5’s frontend was a new design and the loop buffer never got added back. As to why, they said the loop buffer was primarily a power optimization. It could help IPC in some cases, but the primary goal was to let Zen 4 shut off much of the frontend in small loops. Adding any feature has an engineering cost, which has to be balanced against potential benefits. Just as with having dual decode clusters service a single thread, whether the loop buffer was worth engineer time was apparently “no”."
If so, it might be a classic case of "Team of engineers spent months working on new shiny feature which turned out to not actually have any benefit, but was shipped anyway, possibly so someone could save face".
I see this in software teams when someone suggests it's time to rewrite the codebase to get rid of legacy bloat and increase performance. Yet, when the project is done, there are more lines of code and performance is worse.
In both cases, the project shouldn't have shipped.
no. once the core has it and you realize it doesn't help much, it absolutely is a risk to remove it.
The ease of pushing updates encourages lazy coding.
Certainly in some cases, but in others, it just shifts the economics: Obviously, fault tolerance can be laborious and time consuming, and that time and labor is taken from something else. When the natures of your dev and distribution pipelines render faults less disruptive, and you have a good foundational codebase and code review process that pay attention to security and core stability, quickly creating 3 working features can be much, much more valuable than making sure 1 working feature will never ever generate a support ticket.
They showed me the facilities, and the vast majority was taken up by testing and validation rigs. The sensors would go through many stages, taking several weeks.
The final stage had an adjacent room with a viewing window and a nice couch, so a representative for the client could watch the final tests before bringing the sensors back.
Quite the opposite to the "just publish a patch" mentality that's so prevalent these days.
Even for software it’s often risky to remove code once it’s in there. Lots of software products are shipped with tons of unused code and assets because no one’s got time to validate nothing’s gonna go wrong when you remove them. Check out some game teardowns, they often have dead assets from years ago, sometimes even completely unrelated things from the studio’s past projects.
Of course it’s 100x worse for hardware projects.
Was shipped anyway because it can be disabled with a firmware update and because drastically altering physical hardware layouts mid design was likely to have worse impacts.
Building chips is a multiyear process and most folks don’t understand this.
Tell that to the share holders. As a public company, they can very quickly lose enormous amounts of money by being behind or below on just about anything.
Remember most of the technical analysis on Chips and Cheese is a one person effort, and I simply don't have infinite free time or equipment to dig deeper into power. That's why I wrote "Perhaps some more mainstream tech outlets will figure out AMD disabled the loop buffer at some point, and do testing that I personally lack the time and resources to carry out."
More so that estimating power when you don't have access to post synthesis simulations or internal gas gauges is very hard. For something so small, I can easily see this being a massive pain to measure in the field and the kind of thing that would easily vanish into the noise on a real system.
But in the absence of any clear answer, I do think it's reasonable to assume that the feature does in fact have the power advantages AMD intended, even if small.
This is often pretty common, as the performance characteristics are often unknown until late in the hardware design cycle - it would be "easy" if each cycle was just changing that single unit with everything else static, but that isn't the case as everything is changing around it. And then by the time you've got everything together complete enough to actually test end-to-end pipeline performance, removing things is often the riskier choice.
And that's before you even get to the point of low-level implementation/layout/node specific optimizations, which can then again have somewhat unexpected results on frequency and power metrics.
There is an added benefit though - that the new programmers now are fluent in the code base. That benefit might be worth more than LOCs or performance.
I wonder if that will be the key benefit of Google's switch to two "major" Android releases each year: it will get people used to nothing newsworthy happening within a version increment. And I also wonder if that's intentional, and my guess is not the tiniest bit.
It tests the performance benefit hypothesis in different scenarios and does not find evidence that supports it. It makes one best effort attempt to test the power benefit hypothesis and concludes it with: "Results make no sense."
I think the real take-away is that performance measurements without considering power tell only half the story. We came a long way when it comes to the performance measurement half but power measurement is still hard. We should work on that.
There will be a certain number of people who will delay an upgrade a bit more because the new machines don’t have enough extra oomph to warrant it. Little’s Law can apply to finance when it’s interval between purchases.
Energy used per instruction is almost certainly the metric that should be considered to see the benefits of this loop buffer, not energy used per second (power, watts).
While you can somewhat isolate for this by doing hundreds of runs for both on and off, that takes tons of time and still won’t be 100% accurate.
Even disabling the feature can cause the code to use a different branch which may shift everything around.
I am not specifically familiar with this issue, but I have seen cases where disabling a feature shifted the load from integer units to the FPU or the GPU as an example, or added 2 additional instructions while taking away 5.
That's not quite how it was implemented.
Instead, the second 68000 was halted and disconnected from the bus until the first 68000 (the executor) trigged a fault. Then the first 68000 would be held in halt, disconnected from the bus and the second 68000 (the fixer) would take over the bus to run the fault handler code.
After the fault had been handled, the first 68000 could be released from halt and it would resume execution of the instruction, with all state intact.
As for the cost of a second 68000, extra logic and larger PCBs? Well, the of the Motorola 68451 MMU (or equivalent) absolutely dwarfed the cost of everything else, so adding a second CPU really wasn't a big deal.
Technically it didn't need to be another 68000, any CPU would do. But it's simpler to use a single ISA.
For more details, see Motorola's application note here: http://marc.retronik.fr/motorola/68K/68000/Application%20Not...
While this executor + fixer setup does work for most usecases, it's still impossible to recover the state. The relevant state is simply held in the halted 68000.
Which means, the only thing you can do is handle the fault and resume. If you need to page something in from disk, userspace is entirely blocked until the IO request completes. You can't go and run another process that isn't waiting for IO.
I suspect it also makes it impossible to correctly implement POSIX segfault signal handlers. If you try to run it on the executor, then the state is cleared and it's not valid to return from the signal handler anymore.
If you run the handler on the fixer instead, then you are running in a context without pagefaults, which would be disastrous if the segfault handler access code or data that has been paged out. And the now segfault handler wouldn't have access to any of the executor's CPUs state.
------
So there is merit to the idea of running two 68000s in lockstep. That would theoretically allow you to recover the full state.
But there is a problem: It's not enough to run the second 68000 one cycle behind.
You need to run it one instruction behind, putting all memory read data and wait-states into a FIFO for the second 68000 to consume. And 68000 instructions have variable execution time, so I guess the delay needs to be the length of the longest possible instruction (which is something like 60 cycles).
But what about pipelining? That's the whole reason why can't recover the state in the first place. I'm not sure, but it might be necessary to run an entire 4 instructions behind, which would mean something like 240 cycles buffered in that FIFO.
This also means your fault handler is now running way too soon. You will need to emulate 240 cycles worth of instructions in software until you find the one which triggered the page fault.
I think such an approach is possible, but it really doesn't seem sane.
--------
I might need to do a deeper dive into this later, but I suspect all these early dual 68000 Unix workstations simply dealt with the issues of the executor/fixer setup and didn't implement proper segfault signal handlers. It's reasonably rare for programs to do anything in a segfault handler other than print a nice crash message.
Any unix program that did fancy things in segfault handlers weren't portable, as many unix systems didn't have paging at all. It was enough to have a memory mapper with a few segments (base, size, and physical offset).
could anyone do any better on 68000? My incomplete history of CPU dedicated fast paths for moving data:
- 1982 Intel 186/286 'rep movsw' at theoretical 2 cycles per byte (I think its closer to 4 in practice). Brilliant, then intel drops the ball for 20 years :|
- 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) at hilarious 7 cycles per byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Afaik no loop buffer meant its re-fetching whole instruction every loop.
- 1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) at hysterical 6 cycles per byte plus 17 cycles startup. (17 + 6x) = ~160KB/s at 7.16 MHz CPU. For comparison IBM XT with 4.77 MHz NEC V20 does >300KB/s
- 1993 Pentium 'rep movsd' at theoretical 4 bytes per cycle, 0.31 cycles per byte in practice http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...
- 1995 Pentium Pro "fast string mode" strongly hinted at REP MOVS as the optimal way to copy memory.
- 1997 Pentium MMX 'rep movsd' 0.27 cycles per byte. Mem copy with MMX registers 0.29 cycles per byte.
- 2000 SSE2 optimized copy hack.
- 2008 AVX optimized copy hack at ~full L2/memory bus speed for large enough transfers.
- 2012 Ivy Bridge Enhanced REP MOVSB (ERMSB), but funnily still slower than even the SSE2 variants.
- 2019 Ice Lake Fast Short REP MOVSB (FSRM) still somewhat slower than AVX variants on unaligned accesses.
- 2020 Zen3 FSRM !20 times! slower than AVX unaligned, 30% slower on aligned https://lunnova.dev/articles/ryzen-slow-short-rep-mov/
- 2023 And then Intel got Reptar https://lock.cmpxchg8b.com/reptar.html :)
That being said, some workloads will see a small regression, however AMD has made some small performance improvements since launch.
They should have just made it a BIOS option for Zen 4. The fact they do not appear to have done so does indicate the possibility of a bug or security issue.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
I can't say more. :(
This should be fun, however, for someone with enough time to chase down and try and find the bug. Depending on the consequences of the bug and the conditions under which it hits, maybe you could even write an exploit (either going from JavaScript to the browser or from user mode to the kernel) with it :) Though, I strongly suspect that reverse engineering and weaponizing the bug without any insider knowledge will be exceedingly difficult. And, anyways, there's also a decent chance this issue just leads to a hang/livelock/MCE which would make it pointless to exploit.
Or, framed differently, if Intel or AMD announced a new gamer CPU tomorrow that was 3x faster in most games but utterly unsafe against all Meltdown/Spectre-class vulns, how fast do you think they'd sell out?
Can you elaborate on that? It sounds interesting
I was always morbidly curious about programming those, but never to the point of actually buying one, and I always had more things to do in the day than time in past life when we had a few of the cards in my office.
Given that we have effectively two browser platforms (Chromium and Firefox) and two operating systems to contend with (Linux and Windows), it seems entirely tractable to get the security sensitive threads scheduled to the "S cores".
So I think it should be the javascript that should run on these hypothetical cores.
Though perhaps a few other operations might choose to use them as well.
I think we're headed towards the future of many highly insulated computing nodes that share little if anything. Maybe they'd have a faster way to communicate, e.g. by remapping fast cache-like memory between cores, but that memory would never be uncontrollably shared the way cache lines are now.
Well, many people have gaming computers, they won't use for anything serious. So I would also buy it. And in restricted gaming consoles, I suppose the risk is not too high?
To mitigate it browsers did a bunch of hacks, including nerfing precision on all timer APIs and disabling shared memory, because you need an accurate timer for the exploit - to this day performance.now() rounds to 1MS on firefox and 0.1MS on Chrome.
This 1MS rounding funnily is a headache for me right as we speak. On a say 240Hz monitor, for video games you need to render a frame every ~4.16ms -- 1ms precision is not enough for accurate ticker -- even if you render your frames on time, the result can't be perfectly smooth as the browser doesn't give an accurate enough timer by which to advance your physics every frame.
[1]:https://www.club386.com/assassins-creed-shadows-drm-wants-to...
10-20 min, depending on how many they make :)
Which makes it kind of terrible that the kernel has these mitigations turned on by default, stealing somewhere in the neighborhood of 20-60% of performance on older gen hardware, just because the kernel has to roll with "one size fits all" defaults.
If you don't know what kernel parameters are and what do they affect, it's likely safer to go with all the mitigations enabled by default :-|
Appreciate sharing the gist though!
I think things will only shift once we have systems they ship with fully sandboxes that are minimally optimized and fully isolated. Until then we are forced to assume the worst.
The problem is that you need to execute on the system, then need to know which application you’re targeting, then figure out the timings, and even then you’re not certain you are getting the bits you want.
Enabling mitigations For servers? Sure. Cloud servers? Definitely. High profile targets? Go for it.
The current defaults are like foisting iOS its “Lockdown Mode” on all users by default and then expecting them to figure out how to turn it off, except you have to do it by connecting it to your Mac/PC and punching in a bunch of terminal commands.
Then again, almost all kernel settings are server-optimal (and even then, 90s server optimal). There should honestly should be some serious effort to modernize the defaults for reasonably modern servers, and then also have a separate kernel for desktops (akin to CachyOS, just more upstream).
These defaults are needed and if the loss is so massive we should be willing to embrace less programmable but more secure options.
SINGLE thread best and worst case have to be the same to avoid speculation...
However for threads from completely unrelated domains could be run instead, if ready. Most likely the 'next' thread on the same unit, and worry about repacking free slots the next time the schedule runs.
++ Added ++
It might be possible to have operations that don't cross security boundaries have different performance as operations within a program's space.
An 'enhanced' level of protection for threads running a VM like guest code segment (such as browsers) might also be offered that avoids higher speculation operations.
Any operation similar to a segmentation fault relative to that thread's allowed memory accesses could result in forfeit of it's timeslice. Which would only leak what it should already know anyway, what memory it's allowed to access. Not the content of other memory segments.
This introduces HT side channel vulnerabilities. You would have to static partition all caches and branch predictors.
Also this is more or less how GPUs work. It is great for high throughput code, terrible for latency sensitive code.
I do realize that gamers aren't the most logical bunch, but aren't most games GPU-bound nowadays?
So as long as stuff is not perfectly isolated from each other then there's always a room for a bad actor to snoop on stuff.
"Attacks only get better."
> Zen 4 is AMD's first attempt at putting a loop buffer into a high performance CPU. Validation is always difficult, especially when implementing a feature for the first time. It's not crazy to imagine that AMD internally discovered a bug that no one else hit, and decided to turn off the loop buffer out of an abundance of caution. I can't think of any other reason AMD would mess with Zen 4's frontend this far into the core's lifecycle.
If you don't disclose the vulnerability then affected parties cannot start taking countermeasures, except out of sheer paranoia.
Disclosing a vulnerability is a way shift liability onto the end user. You didn't update? Then don't complain. Only rarely do disclosures lead to product liability. I don't remember this (liability) happening with Meltdown and Spectre either. So wouldn't assume this is AMD being secretive.
I have enough trouble if someone responds to my responses in a tone similar to GP and I end up treating them like the same person (eg, GP makes a jab and now I’m snarky 9r call out the wrong person). Especially if I have to step away to deal with life.
> Still, the Cyberpunk 2077 data bothers me. Performance counters also indicate higher average IPC with the loop buffer enabled when the game is running on the VCache die. Specifically, it averages 1.25 IPC with the loop buffer on, and 1.07 IPC with the loop buffer disabled. And, there is a tiny performance dip on the new BIOS.
Smells of microcode mitigations if you ask me, but naturally let’s wait for the CVE.
( Idly waiting for x86 to try and compete with ARM on efficiency. Unfortunately I dont see Zen 6 or Panther Lake getting close. )
But when the CPU is pulling 100w under load? Well now we're talking an amount so small it's irrelevant. Maybe with a well calibrated scope you could figure out if it was on or not.
Since this is in the micro-op queue in the front end, it's going to be more about that very low total power draw side of things where this comes into play. So this would have been something they were doing to see if it helped for the laptop skus, not for the desktop ones.
It's very very expensive to fix a bug in a CPU, so it's easier to expose control flags or microcode so you can patch it out.
https://www.ardent-tool.com/CPU/Cyrix_Cx486.html#soft
https://www.vogons.org/viewtopic.php?t=45756 Register settings for various CPUs
https://www.vogons.org/viewtopic.php?t=30607 Cyrix 5x86 Register Enhancements Revealed
L1, Branch Target Buffer, LSSER (load/store reordering), Loop Buffer, Memory Type Range Registers (Write Combining, Cacheability), all controlled using client side software.
Cyrix 5x86 testing of Loop Buffer showed 0.2% average boost and 2.7% maximum observable speed boost.
I guess this could also be used as an optimization target at least on devices that are more long lived designs (eg consoles).
> Perhaps the best way of looking at Zen 4's loop buffer is that it signals the company has engineering bandwidth to go try things. Maybe it didn't go anywhere this time. But letting engineers experiment with a low risk, low impact feature is a great way to build confidence. I look forward to seeing more of that confidence in the future.