There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.
I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.
I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.
I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.
My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".
For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?
The overhead is about 10% of samples. But at least you can unwind on systems without frame-pointers. Personally I'll take the statistical anomalies of frame-pointers which still allow you to know what PID/TID are your cost center even if you don't get perfect unwinds. Everyone seems motivated towards SFrame going forward, which is good.
https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-fr...
In any event I don't understand why frame pointers need to be in by default instead of developers enabling where needed.
Having Kitten include pointers by default seems reasonable enough, since Kitten is a devel system.
Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.
But there are some 1% costs that are worth it.
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
* modify system services?
* run a compiler?
* add custom package repositories?
* change the default shell?
I believe the answer to all of the above is "no".
Visa makes billions per year off of nothing but collecting a mere 2%-3% tax on everything else.
The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.
CPUs spends cycles for features (doing useful work). Enabling frame pointers skims off a percentage of the cycles. But it's the impact on useful work that matters, not how many cycles you lose. The cycles are just a means to an end. So x% of cycles is fundamentally incomparable to x% of money.
This does not explain why a distribution should have such a feature on by default. It only explains why Netflix wants it on some of their systems.
And they did.
The question is though why only Netflix should benefit from that. It takes a lot of effort to recompile an entire Linux distribution.
---
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
---
"people across the industry" is a meaningless and valueless term and is an empty argument.
I think a big enough fraction of potentially-useful crash reports come from those systems to make it a good default.
They have designed the instruction set in such a way that two distinct registers were necessary for fulfilling the roles of the stack pointer and of the frame pointer.
In better designed instruction sets, for example in IBM POWER, a single register is enough for fulfilling both roles, simultaneously being both stack pointer and frame pointer.
Unfortunately, the Intel designers have not thought at all about this problem, but in 1978 they have just followed the example of the architectures popular at that time, e.g. DEC VAX, which had also made the same mistake of reserving two distinct registers for the roles of stack pointer and of frame pointer.
In the architectures where a single register plays both roles, the stack pointer always points to a valid stack frame that is a part of a linked list of all stack frames. For this to work, there must be an atomic instruction for both creating a new stack frame (which consists in storing the old frame pointer in the right place of the new stack frame) and updating the stack pointer to point to the new stack frame. The Intel/AMD ISA does not have such an atomic instruction, and this is the reason why two registers are needed for creating a new stack frame in a safe way (safe means that the frame pointer always points to a valid stack frame and the stack pointer always points to the top of stack).
If you enable frame pointers, you need to recompile every library your executable depends on. Otherwise, the unwind will fail at the first function that's not part of your executable. Usually library function calls (like glibc) are at the top of the stack, so for a large portion of the samples in a typical profile, you won't get any stack unwind at all.
In many (most?) cases recompiling all those libraries is just infeasible for the application developers, which is why the distro would need to do it. Developers can still choose whether to include frame pointers in their own applications (and so they can still pick up those 1-2% performance gains in their own code). But they're stuck with frame pointers enabled on all the distro provided code.
So the choice developers get to make is more along the lines of: should they use a distro with FP or without. Which is definitely not ideal, but that's life.
FYI, if you happen to be running on an intel cpu, --call-graph lbr uses some specicalized hardware and often delivers a far superior result, with some notable failure modes. Really looking forward to when AMD implements a similar feature.
It's certainly true that there can be junk in --call-graph fp, too.
Wait, are we really that close to the maximum of what a compiler can optimize that we're getting barely 1% performance improvements per year with new versions?
There are some transformations that compilers are really bad at. Rearranging data structures, switching out algorithms for equivalent ones with better big-O complexity, generating & using lookup tables, bit-packing things, using caches, hash tables and bloom filters for time/memory trade offs, etc.
The spec doesn't prevent such optimizations, but current compilers aren't smart enough to find them.
This would just be an extension of that. If the code creates and uses a linked list, yet the list is 1M items long and being accessed entirely by index, branch to a different version of the code which uses an array, etc.
However, for sure, if compiler optimizations disappeared, HW would pick up the slack in a few years.
Current compilers could do a lot better with vectorization, but it will often be limited by the data structure layout.
9X% of users do not care about a <1% drop in performance. I suspect we get the same variability just by going from one kernel version to another. The impact from all the Intel mitigations that are now enabled by default is much worse.
However I do care about nice profiles and stack traces without having to jump through hoops.
Asking people to recompile an _entire_ distribution just to get sane defaults is wrong. Those who care about the last drop should build their custom systems as they see fit, and they probably already do.
So yes it will be faster than alternatives to frame-pointers, but it still wont be as fast as frame pointers.
Are you the majority?
Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled?
There is no way the answer is "most of them".
You have a job where you profile things, and maybe even you profile almost everything you touch. Your whole world has a high quotient of profiling in it. So you want the whole system built for profiling by default. How convenient for you. But your whole world is not the whole world.
But it's not just you, there are, zomg thousands, tens of thousands, maybe even hundreds of thousands of developers and ops admins the same as you.
Yes and? Is even that most installed instances of any given executable? No way.
Or maybe yes. It's possible. Can you show that somehow? But I will guess no way and not even close.
You can't do that when step one is reinstall another distro and reproduce your problem.
Additionally, the overhead for performance related things that could fall into the 1% range (hint, it's not much) rarely are using the system libraries in such a way anyway that would cause this. They can compile that app with frame-pointers disabled. And for stuff where they do use system libraries (qsort, bsearch, strlen, etc) the frame pointer is negligible to the work being performed. You're margin of error is way larger than the theoretical overhead.
1% of all activity is only rational if you get more than 1% of all activity back out from those times and places where it was used.
1%, when it's of everything, is an absolutely stupendous collossal number that is absolutely crazy to try to treat as trivial.
Visibility at the “cost” of negligible impact is more important than raw performance. That’s it.
I’m a regular user of Linux with some performance sensitivity that does not go as far as “I _need_ that extra register!”. That’s what the majority of developers working on Linux are like. I think it’s up to _you_ to prove the contrary.
So, "yes". In fact "yes, duh?" Talk about head in sand...
> I see no point myself and I'm even in the field.
You don't see the point of readable stack traces?
This is an absurd way to evaluate it. All it takes is one savvy user to report a performance problem that developers are able to root-cause using stack traces from the user's system. Suppose they're able to make a 5% performance improvement to the program. Now all user's programs are 5% faster because of the frame pointers on this one user's system.
At this point people usually ask: but couldn't developers have done that on their own systems with debug code? But the performance of debug code is not the same as the performance of shipping code. And not all problems manifest the same on all systems. This is why you need shipping code to be debuggable (or instrumentable or profileable or whatever you want to call it).
Most systems need to generate useful crash reports. Even end user systems. What kind of system doesn't need them? How else are developers supposed to reliably address user complaints?
Theoretically, there are alternative ways to generate stacktraces without using frame pointers. The problem is, they're not nearly as ubiquitous and require more work to integrate them in existing applications and workflows. That makes them useless in practice for a large number of cases.
Except Python got opted out of the frame pointer change due to benchmarks showing slowdowns of up to 10%. The discussion around that had the great idea of just adding a pragma to flat out override the build setting. So in the end that "%1" reduction claim only holds if everything even remotely affected silently ignores the flag.
They only opted out for 3.11 which did not yet have the perf-integration fixes anyway. 3.12 uses frame-pointers just fine.
The same could be said about any accessibility issue or minority language translations
There's basically no downside to fixing accessibility issues or adding new language translations other than the work involved in doing so. (And yes, maintaining translations over time is hard, but most projects let them lag during development, so they don't directly hold anything back.) There is a rather glaring downside to this performance optimization, whose upside is sometimes entirely within run-to-run variance and can be blown away by almost any other performance tweak. It's clear the optimization has some upsides, but an extra register and saving some trivial loads/stores just isn't as big of a deal on modern processors that are loaded to the gills with huge caches and deep pipelines.
I guess I don't care that much about fomit-frame-pointer in the grand scheme of things, but I think enabling it in distributions was ultimately a mistake. If some software packages benefited enough from it, it could've just been done only for those packages. Doing it across the system is questionable at best...
Some of the statements in the post seem odd to me though.
- 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
- Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
- Even if we're in the prologue, we know that we're in a leaf call, we can still resolve the instruction pointer to the function, and we can read the return address to find the parent, so what information is lost?
When it comes to future alternatives, while frame pointers have their own problems, I think that there are still a few open questions:
- Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
- Is the memory overhead of lookup tables for very large programs acceptable?
Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.
> Is the memory overhead of lookup tables for very large programs acceptable?
I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.
Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.
I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):
0000000000007000 <main>:
7000: f3 0f 1e fa endbr64 ; landing pad
7004: 55 push %rbp ; set up frame pointer
7005: 48 89 e5 mov %rsp,%rbp
It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.> - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.
I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)
Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.
Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.
But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".
Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.