Can the comparison also be done using the max precision unsigned integers instead of just float64?
I was just complaining about long double on Windows working on an algorithm that can benefit from higher precision, this post is serendipitous.
Math functions are provided via libquadmath. Newer glibc also provides quad precision functions with slightly different names, IIRC using a f128 suffix rather than "q" like libquadmath.
Notably lacking support for ARM.
The article also gives an example from quantum physics. Some chaotic or highly numerically unstable problems just need quad precision to not blow up; primarily problems/algorithms (can be the problem itself or the algorithm to solve it that causes the issues) that tend to catastrophically amplify small errors. Matrix inversion is another example, to the point that most algorithms try to do almost anything to avoid explicitly inverting a matrix. But sometimes you just have to.
Tricks like clever factorization (which a lot of factorization algorithms have their own severe numerical issues e.g. some of the ways to compute QR or SVD), preconditioners, sparse and/or iterative algorithms like GMRES, randomized algorithms (my favorite) etc are all workarounds you wouldn't need if there was a fast and stable way to exactly invert any arbitrary non-singular matrix. Well, you would have less need for, there are other benefits to those methods but improving numerical stability by avoiding inverses is one of the main ones.
Once you decide you need to go over 64 bits, it largely becomes a matter of convenience, and I can easily see two doubles being more convenient than either a quad or a 128-bit integer.
If for e.g. you see bare summation rather than some sort of sorting by magnitude then expect bad results as numbers get bigger.
I looked up my consumer card on a whim (playing with sprinkling some GPGPU into my code—cupy is actually really easy to use and ergonomic!), and apparently doubles are 1/32 as fast as singles. Yikes!
For example: simulating the three body problem (three celestial bodies of equalish size orbiting eachother). Very small differences get amplified, and any difference at all in how the math is implemented on different systems will result in divergent solutions.
Why introduce a new alias that might be 128 bits but also 80 ? IMO the world should focus on well defined types (f8, f16, f32, f64, f80, f128), then maybe add aliases.
If you have written a linear system solver, you might prefer to express yourself in terms of single/double precision. The user is responsible for knowing if their matrix can be represented in single precision (whatever that means, it is their matrix and their hardware after all), how well conditioned it is, all that stuff. You might rather care that you are working in single precision, and that there exists a double precision (with, it is assumed, hardware support, because GPUs can’t hurt us if we pretend they don’t exist) to do iterative refinement in if you need to.
For some context, an electron "has structure" roughly on a length scale of about 10^-15 meters (details here unimportant). If you're representing a quantity in O(1) meters, then doing a handful of floating-point operations in double will likely accumulate error on the order of O(1) electrons. Incredible accuracy!
I think the key here is to be mindful of units and make sure you're combining like with like and scaling things appropriately as you go. If you're doing things the wrong way, quad will punish you eventually, too.
An underappreciated tool for working more confidently with floats is interval arithmetic. Worth checking out if you actually (truly) need to do things robustly (most people don't).
Proposals like Posits aim to improve on this :)
A great thing in the Posit proposal is that they also consider a distinct accumulator type for things like sums and dot product. In my experience (I did a PhD on measuring floating point error in large numerical simulations) those operations are the ones that are most likely to sneak on you and need fixing. But most people don't realize that those accumulator type are actually easy to use with IEEE floats, they are entirely orthogonal to using Posits.
I was scratching my head for a bit there... It's also good to see that custom dtypes have come a long way. This is a great application of a custom dtype, and that wouldn't have been possible a few years back (well, okay, a "few" to me is probably like 8 years, but still).
In general, using a well-placed compensated (or exact) sum/dot product algorithm will solve most of your problems at a low cost. Nowadays, most people are familiar with Kahan summation (and related tricks like sorting numbers), but they do not know that it does not help much (often less than 128-bit floating points) and that the state of the art for those algorithms has moved quite a bit since. Nowadays, you can ensure that your sum/BLAS operation produces a result within one ulp of the exact result!
I highly recommend perusing the Handbook of Floating-Point Arithmetic (the Accurate Algorithm online page also seems fine [0]) to get an idea of what can be done. Implementations of these ideas can be found in most languages used for numerical computation (a quick Google search gave me options for C++/Python [1], Rust [2], etc.).
[0]: https://accurate-algorithms.readthedocs.io/en/latest/index.h...
It's interesting that 64 bits are often necessary (as opposed to 32) but 128 bits so rarely are.
> Significand: 113 bits
> Exponent: 15 bits
Seems like a worse tradeoff than Dec128 [1] (110 bits, 17 bits) unless it's way faster.
[1] https://www.intel.com/content/www/us/en/developer/articles/t...