All-around winning, $500, 8 cores makes no sense.
This thing has a premium gaming price tag because there is nothing close to it other than their own 7800X3D.
It is pretty competitive on the Multi-Core rating: https://browser.geekbench.com/v6/cpu/8633320 compared to other CPUs: https://browser.geekbench.com/processor-benchmarks
Less cache misses (on popular workloads) helps decrease power and increase performance enough that few things benefit from 12-16 cores.
Thus the M3 max (with a 512 bit wide memory system) has a class leading single core and multi-core scores.
The apple chips are APUs and need a lot of their memory bandwidth for the gpu. Are there any good resources on how much of this bandwidth is actually used in common cpu workloads? Can the CPU even max out half of the 512bit bus?
[1] https://www.computerbase.de/artikel/prozessoren/amd-ryzen-7-...
The higher CCD configurations have 1 IF link per chip, the lower have 2 IF links per chip. Presumably AMD would bother with the 2 IF link chips unless it helped.
That said, each link gives a CCD 64GB/s of read speed and 32GB/s of write speed. 8000MHz memory at 128 bits would get up to 128GB/s. So being stuck with one link would bottleneck badly enough to hide the effects of memory speed.
It doesn't make much difference to most apps, but I believe the single CCD (like the 9700x) has better bandwidth to IOD then their dual CCD chips, like the 9900x and 9950x
Similarly on the server chips you can get 2,4,8, or 16 CCDs. To get 16 cores you can use 2 CCDs or 16 CCDs! But the sweet spot (max bandwidth per CCD) is at 8 CCDs where you get a decent number of cores and twice the bandwidth per CCD. Keep in mind the genoa/turin EPYC chips have 24 channels (32 bit x 24) for a 768 bit wide memory interface. Not nearly as constrained as their desktops.
Wish I could paste in a diagram, but check out:
https://www.amd.com/content/dam/amd/en/documents/epyc-techni...
Page 7 has a diagram of 96 core with one GMI (IF) port per CCD and a 32 core chip two GMI ports per CCD.
That's a gen old I believe, the max CCDs is now 16, not 12 with turin.
some diagrams: https://www.servethehome.com/amd-epyc-genoa-gaps-intel-xeon-...
From another page: "The most noteworthy aspect is that there is a new GMI3-Wide format. With Client Zen 4 and previous generations of Zen chiplets, there was 1 GMI link between the IOD and CCD. With Genoa, in the lower core count, lower CCD SKUs, multiple GMI links can be connected to the CCD."
And it seems like all the chiplets have two links, but everything I can find says they just don't hook up both on consumer parts.
I dug around a bit, and it seems Ryzen doesn't get it. I guess that makes sense, if the IOD on ryzen gets 2 GMI links. On the single CCD parts there's no other CCD to talk to. On the dual CCD parts there's not enough GMI links to have both with GMI-wide.
Maybe this will be different on the pending Zen 5 part (Strix Halo) that will have 256 bits wide (16 x 32 bit) @ 8533 MHz = 266 GB/sec since there will be 2 CCDs and a significant bump to memory bandwidth.
Consumer platforms do NOT do this; this has actually been discussed in depth in the Threadripper Pro space. The low CCD parts were hamstrung by the shortage of IF links, meaning that they got a far smaller bump from more than 4 channels of populated RAM than they could have.
Generally CPUs have relatively small reorder windows, so a cache miss hurts bad, 80ns latency @ 5 GHz is 400 clock cycles, and something north of 1600 instructions that could have been executed. If one in 20 operations is a cache miss that's a serious impediment to getting any decent fraction of peak performance. The pain of those cache misses is part of why the X3D does so well, even a few less cache misses can increase performance a fair bit.
With 8c/16 threads having only 2 (DDR4) or 4 (DDR5) cache misses pending with a 128 bit wide system means that in any given 80-100ns window only 2 or 4 cores can continue resume after a cache miss. DDR-6000 vs DDR-7800 doesn't change that much, you still wait the 80-100ns, you just get the cache line in 8 (16 for ddr5) cycles @ 7800MT/sec instead of 8 (16 for DDR5) cycles @ 6000MT/sec. So the faster DDR5 means more bandwidth (good for GPUs), but not more cache transactions in flight (good for CPUs).
With better memory systems (like the Apple m3 max) you could have 32 cache misses per 80-100ns. I believe about half of those are reserved for the GPU, but even 16 would mean that all of the 9800X3Ds 16 threads could resolve a cache miss per 80-100ns instead of just 2 or 4.
That's part of why a M4 max does so well on multithreaded code. M4 max does better on geekbench 6 multithread than not only the 9800x3d (with 16 threads) but also a 9950x (with 16c/32 threads). Pretty impressive for a low TDP chip that fits in thin/light laptop with great battery life and competes well against Zen 5 chips with a 170 watt TDP that often use water cooling.
Isn't that the purpose of banks and bank groups, letting a bunch of independent requests work in parallel on the same channel?
There are minor tweaks, I believe you can send a row, column, then on future accesses send only the column. There's also slight differences in memory pages (a dimm page != kernel page) that decrease latency with locality. But the differences are minor and don't really move the needle on main memory latency of 60 ns (not including the L1/l2/l3 latency which have to miss before getting to the memory controller).
There are of course smarter connections, like AMD's hypertransport or more recently infinity fabric (IF) that are async and can have many memory transactions in flight. But sadly the dimms are not connected to HT/IF. IBM's OMI is similar, fast async serial interface, with an OMI connection to each ram stick.
It looks like the X3D is no better than the 9900X for non-game single-threaded workloads like browsers, and it's much worse than the 12 or 16 core parts in terms of overall throughput, so for a non-gamer the plain X seems much better than the X3D.
https://hothardware.com/reviews/amd-ryzen-7-9800x3d-processo...
https://www.techpowerup.com/review/amd-ryzen-7-9800x3d/23.ht...
The rest of Zen5 was maybe a 5% bump on average, and Intel's new series actually regressed in performance compared to 14th gen.
Seems like the Zen5X3D's will be the only good parts this time around.
The idea is the new platform will allow for better development in future, while improving efficiency fairly significantly.
I certainly hope the next generation is a massive bump for Intel, but we'll see if that's the case.
... and their 13th/14th generation processors had serious problems with overvoltage-induced failures - they clearly needed to step back and focus on reliability over performance.
Had also seen how he had editorialized some of my mailing list posts and I felt that I would be guilty of Gell-Mann amnesia if I carried on reading the site.
I think it topping the machine learning benchmarks has to do with having only 8 cores to share the 96MB of L3 cache, which ends up having a ratio of 1core having 1MBL2 + 12MB L3 which is huge, that means EACH THREAD has more cache than i.e the entire nvidia 3090 (6mb l2 total), and this ends up taking FULL advantage of the extra silicon of various avx extensions.
But now cache is underneath.
I have never had one of the 3D V-Cache processors and am curious how it would improve the benchmarks for my multi-threaded data management system that does many operations against a set of 4K blocks of data.
I heard rumors that a 9950x3D version will be available in January. I am trying to figure out if I should wait.
Except the fact that your computer runs more than one thread. Pity that this "single core" performance cannot be utilized at its maximum potential.
Apple is paying for exclusive access to TSMC's next node. That improves their final products, but doesn't make their architecture inherently better.
The Geekbench scores cannot compare laptop CPUs with desktop CPUs, because the tasks that are executed are too short and they do not demonstrate the steady-state throughput of the CPUs. The desktop CPUs are much faster for multithreaded tasks in comparison with laptop/tablet CPUs than it appears in the GB results.
The Apple CPUs have a much better instructions-per-clock-cycle ratio than any other CPUs, and now in M4 they also have a relatively high clock frequency, of at least 4.5 GHz. This allows them to win most single-threaded benchmarks.
However the performance in multi-threaded benchmarks has a very weak dependence on the CPU microarchitecture and it is determined mostly by the manufacturing process used for the CPU.
If we were able to compare Intel, AMD and Apple CPUs with the same number of cores and made with the same TSMC process, their multithreaded performance would be very close at a given power consumption.
The reason is that executing a given benchmark requires a number of logic transitions that is about the same for different microarchitectures, unless some of the design teams have been incompetent. An Apple CPU does more logic transitions per clock cycle, so in single thread it finishes the task faster.
However in multithreaded execution, where the power consumption of the CPU reaches the power limit, the number of logic transitions per second in the same manufacturing process is determined by the power consumption. Therefore the benchmark will be completed in approximately the same number of seconds when the power limits are the same, regardless of the differences in the single-threaded performance.
At equal power, an M4 will have a slightly better MT performance than an Intel or AMD CPU, due to the better manufacturing process, but the difference is too small to make it competitive with a desktop CPU.
Bullshit. What you're talking about is the steady-state of the heatsink, not the steady state of the chip. Intel learned the hard way that a fast CPU core in a phone really does become a fast CPU core in a laptop or desktop when given a better cooling solution.
> However in multithreaded execution, where the power consumption of the CPU reaches the power limit, the number of logic transitions per second in the same manufacturing process is determined by the power consumption. Therefore the benchmark will be completed in approximately the same number of seconds when the power limits are the same, regardless of the differences in the single-threaded performance.
No, microarchitecture really does matter. And so does the macro architecture of AMD's desktop chips that burn a huge amount of power on an inefficient chip to chip interface.
The M4 Max you can pre-order gets around 26,000 multicore but is significantly more expensive than the 9950X ($569) or 9800X3D ($479). The M4 Max is a $1,200 premium over the M4 on the 14 inch MacBook Pro and a $1,100 premium over the M4 Pro on the 16 inch.
The M4 Max is only available in the MacBook Pro at present. The Mac Mini and iMac will only get the base M4. The Mac Studio is still based on the M2.
This is just a summary of performance and cost. Portability, efficiency, and compatibility factors will weigh everyone's choices.
♫ Cache rules everything around me ♫
Raw gaming performance increase is good but its gaming efficiency seems to have taken a dip compared to 7800X3D.
So AMD chose to decrease efficiency to get more performance this generation.
Source: https://www.techpowerup.com/review/amd-ryzen-7-9800x3d/23.ht...
For comparison on how limited last gen's X3D was wrt power, tom's hardware has it on 71W with all core AVX, while my 7600X with 2 fewer cores consumes up to 130W
Hopefully Intel pull something out again but they look asleep a the wheel.
65W: https://browser.geekbench.com/v6/cpu/6126001
105W: https://browser.geekbench.com/v6/cpu/5821065
I actually haven't tested it with 170W (which is the default for the 7950x) for whatever reason, but the average 7950x score on geekbench is basically the same as my geekbench scores with lower than normal TDP.
https://browser.geekbench.com/processors/amd-ryzen-9-7950x
I wouldn't be surprised if the same is possible with the newer CPUs.
Nice added bonus is that my PC fans barely spin (not at audible speeds)
Watts/fps @ max fps makes for an interesting graph, but not a very clear comparison. It would be better to compare watts used when locked at a given fps, or fps available when locked at a given wattage. Or watthours to do a video encode (with max wattage, and at various watt limits).
Next rig and everything for the forseeable future will be AMD. I've been a fanboy since the Athlon XP days - took a detour for a bit - but can't wait to get back.
Same. But already built a 3700X and then a 7700X.
I've got this feeling the wife she's gonna upgrade her 3700X to a 7700X soon, meaning I'll get build a 9000 series AMD!
What can explain those disappointing results but only on decompression?