In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.
### The Existing Method: Lemire’s Approach
Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.
Here’s the code for Lemire’s method:
```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 - simd_internal::lzcnt(x | 1); }
JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >> 32; } ```
While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.
### My New Method: RTC-64-Bit Digit Counting
I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.
Here's the code for the new method:
```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };
JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };
JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t inputValue) { const uint64_t originalDigitCount{ digitCounts[simd_internal::lzcnt(inputValue)] }; return originalDigitCount + static_cast<uint64_t>(inputValue > digitCountThresholds[originalDigitCount]); } ```
This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.
### [Benchmark Results](https://github.com/RealTimeChris/BenchmarkSuite/blob/digit-c...)
I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:
#### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.
#### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.
#### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.
#### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.
### Key Takeaways
The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.
JSONIFIER_INLINE int fast_digit_count(uint32_t x)
{
static uint64_t table[] = { ... };
return (x + table[int_log2(x)]) >> 32;
}
I often go slow, to go fast. But lazy is the supremum.
One interesting coincidence is that log₁₀(2) ~= log₂(10) - 3, thus the 83 ~= 78 below.
If you use an 8-bit shift, depending on which direction you're going, you multiply the input length by 851 (which is 3*256 + 83) or 78, then shift by 8. This starts being wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.
For a more accurate approximation (slackening at 289 bits), 217706 (3<<16 + 21098) or 19729, then >> 16
If you ask me, JSON serializers should be simple and slow, and if you need speed, change wire format for something intrinsically fast.
I understand that this is not the approach that people will take, most developers prefer simple interface with complex implementation over complex interface with simple implementation. But my opinion is that implementation simplicity is what matters in the end.
However, I am not sure how significant the gain is here. Actually printing the number requires a division every digit, maybe 2 or 3, which I believe is much slower than even naive digit counting, and you will have to do it eventually.
Maybe it is worth it if there is parallelism involved: write the JSON with blank spaces for the numbers, then have another thread write the numbers. Writing the numbers is computationally expensive because of the divisions, so if you have a lot of large numbers, it may be worth dedicating some cores to this task.
Generally I prefer this kind of approach. But, I suspect your digit counting is a very hot path so your tables will generally be in cache. So your approach will likely win anyway.
edit: some chat messages of me at [0]. Some previous unrelated discussion found at [1].
[0]: https://app.element.io/#/room/#bqn:matrix.org/$KKIK86x0tygAf... (arbitrarily capped at 18 digits because Ryu didn't need more)
private static uint FastDigitCount(ulong value)
{
ReadOnlySpan<byte> digitCounts = [19, 19, 19, 19, 18, 18, 18, 17, 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1];
ReadOnlySpan<ulong> digitCountThresholds = [0, 9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999, 9999999999, 99999999999, 999999999999, 9999999999999, 99999999999999, 999999999999999, 9999999999999999, 99999999999999999, 999999999999999999, 9999999999999999999];
var leadingZeros = BitOperations.LeadingZeroCount(value);
var originalDigitCount = digitCounts[leadingZeros];
return originalDigitCount + (value > digitCountThresholds[originalDigitCount] ? 1u : 0u);
}
Benchmark: https://stackoverflow.com/a/79337820/4503491 (((x + 393206) & (x + 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
This is for integer log 10, but could be adapted for number of digits. It needs a wrapper for 64 bit to invoke it multiple times, but most numbers in a JSON are small, so it might even be competitive; it needs only 4 cycles with enough instruction level parallelism.I gathered this idea from the output of a superoptimizer, it was fun to figure out how it works. For spoilers, see [1].
[1] https://github.com/rust-lang/rust/blob/master/library/core/s...
Is it a big advantage to not need it, or can we safely assume CPU's have fast instructions for this today?
(x > 9) + (x > 99) + (x > 999) + (x > 9999)
(x > CONST) does tend to need a pair of instructions to get the status flag into a register, but those can also fuse with the adds. Critical path latency is cmp-setae-sbb-add, or also four cycles.I'd love to told that this method in the article is actually significantly more efficient than my possibly naïve approach.
Just in case anyone is wondering, this is because the maximum number of decimal digits for a given number of bits is given by:
digits(bits) = ceil(bits * log10(2))
This can be generalized to any number of bits. digits(32) = 10
digits(64) = 20
digits(128) = 39
digits(256) = 78
digits(512) = 155
digits(1024) = 309
Accurate to within 1.