• realtimechris 5 days ago |
    *Optimizing uint64_t Digit Counting: A New Method that Beats Lemire's by Up to 27%*

    In the quest to improve the performance of my high-speed JSON library, JSONIFIER, I recently stumbled upon a breakthrough in optimizing the calculation of digit counts for `uint64_t` values. While Lemire’s method of using the `lzcnt` instruction for determining the number of digits in a 32-bit unsigned integer has been widely regarded as efficient, I’ve developed a new method that achieves even faster results for 64-bit unsigned integers (i.e., `uint64_t`), with significant gains across different compilers and platforms.

    ### The Existing Method: Lemire’s Approach

    Lemire’s method, known for its efficiency, calculates the number of digits in a `uint32_t` by leveraging the `lzcnt` instruction, which finds the index of the most significant bit set to 1. This is combined with a static lookup table to map the result to the corresponding number of digits.

    Here’s the code for Lemire’s method:

    ```cpp JSONIFIER_INLINE int int_log2(uint32_t x) { return 31 - simd_internal::lzcnt(x | 1); }

    JSONIFIER_INLINE int fast_digit_count(uint32_t x) { static uint64_t table[] = { ... }; return (x + table[int_log2(x)]) >> 32; } ```

    While this approach works well for 32-bit integers, the need for a faster and more efficient solution for `uint64_t` led me to create an alternative method, which uses a more streamlined approach without the overhead of a lookup table.

    ### My New Method: RTC-64-Bit Digit Counting

    I’ve designed a new approach for 64-bit unsigned integers that leverages a similar logic but optimizes the process by storing precomputed digit counts for specific ranges and applying simple threshold checks. The result is faster execution with reduced computational overhead.

    Here's the code for the new method:

    ```cpp JSONIFIER_INLINE_VARIABLE uint8_t digitCounts[]{ ... };

    JSONIFIER_INLINE_VARIABLE uint64_t digitCountThresholds[]{ ... };

    JSONIFIER_INLINE uint64_t fastDigitCount(const uint64_t inputValue) { const uint64_t originalDigitCount{ digitCounts[simd_internal::lzcnt(inputValue)] }; return originalDigitCount + static_cast<uint64_t>(inputValue > digitCountThresholds[originalDigitCount]); } ```

    This method works by using a static array to hold the precomputed digit counts and another array for threshold values that determine the exact number of digits in a `uint64_t`. The key optimization lies in the efficient use of a bit manipulation technique and direct threshold checking to avoid unnecessary computations.

    ### [Benchmark Results](https://github.com/RealTimeChris/BenchmarkSuite/blob/digit-c...)

    I ran performance benchmarks comparing my new RTC-64-bit method with Lemire’s approach and the traditional `log10` method across various platforms and compilers. The results were consistently impressive:

    #### GCC/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *27.33%*. - *Lemire-32-bit* beats *Log10-32-bit* by a massive *814.16%*.

    #### Clang/Ubuntu: - *RTC-64-bit* outperforms *Lemire-32-bit* by *143.34%*. - *Lemire-32-bit* beats *Log10-32-bit* by *522.01%*.

    #### MSVC/Windows: - *RTC-64-bit* is *12.50%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *515.90%*.

    #### Clang/MacOS: - *RTC-64-bit* is *25.37%* faster than *Lemire-32-bit*. - *Lemire-32-bit* beats *Log10-32-bit* by *343.97%*.

    ### Key Takeaways

    The RTC-64-bit method not only delivers improved performance on modern hardware but also significantly reduces the overhead of traditional methods by eliminating the need for a large lookup table. This is especially beneficial for high-performance applications where every cycle counts, such as in JSON serialization and parsing, where speed is critical.

    • nick__m 3 days ago |
      if you prefix your code by two space on every lines it will be more readable, ex:

        JSONIFIER_INLINE int fast_digit_count(uint32_t x)
        {
          static uint64_t table[] = { ... }; 
          return (x + table[int_log2(x)]) >> 32; 
        }
    • orra 3 days ago |
      Very nice work. But could you please explain why counting digits quickly (or even slowly) is useful for a JSON serializer? This is lower level than I'm used to working. Is this so you can alloc the right amount of memory, or something else?
      • ComputerGuru 3 days ago |
        Yes. And for string formatting in general such as sprintf to calculate buffer size. Also for proposes of calculating length for string padding and alignment.
      • o11c 3 days ago |
        Honestly? Most of the time you can use an approximation. At worst you allocate a single extra byte unless your integers are hundreds of bits long (I worked this out once but I forget where I left the details). Just multiply by the ratio of logs or something like that.
        • Nevermark 3 days ago |
          Nice. The best kind of optimization: Go lazy, to go fast.

          I often go slow, to go fast. But lazy is the supremum.

        • eapriv 2 days ago |
          Isn’t “the ratio of logs” going to be slower?
          • o11c 2 days ago |
            I mean the constant logs of the bases; you get to hard-code this in the algorithm.

            One interesting coincidence is that log₁₀(2) ~= log₂(10) - 3, thus the 83 ~= 78 below.

            If you use an 8-bit shift, depending on which direction you're going, you multiply the input length by 851 (which is 3*256 + 83) or 78, then shift by 8. This starts being wasteful at 23 bits, but is pretty cheap on 8-bit CPUs.

            For a more accurate approximation (slackening at 289 bits), 217706 (3<<16 + 21098) or 19729, then >> 16

        • exyi 2 days ago |
          I'd assume that the serializer is writing directly into the output buffer, so you'd have to shift everything left by one character if you overallocate. With the added checks for this, it might be faster to compute it precisely the first time.
      • vbezhenar 2 days ago |
        Because JSON is not well suitable as a fast machine exchange format and people trying to make it work anyway.

        If you ask me, JSON serializers should be simple and slow, and if you need speed, change wire format for something intrinsically fast.

        I understand that this is not the approach that people will take, most developers prefer simple interface with complex implementation over complex interface with simple implementation. But my opinion is that implementation simplicity is what matters in the end.

        • touisteur 2 days ago |
          Another way of seeing this is: whatever way people are using JSON, over a fleet of all the machines running the parser (or serializer), reducing the execution time (a good enough proxy to energy consumption) on the whole fleet is worthwile. Especially if it has zero impact on the code calling this parser/serializer (so, no second-order energy spent because of the change breaking some API or contract).
      • GuB-42 2 days ago |
        I think it is the case: first allocate the space, then write.

        However, I am not sure how significant the gain is here. Actually printing the number requires a division every digit, maybe 2 or 3, which I believe is much slower than even naive digit counting, and you will have to do it eventually.

        Maybe it is worth it if there is parallelism involved: write the JSON with blank spaces for the numbers, then have another thread write the numbers. Writing the numbers is computationally expensive because of the divisions, so if you have a lot of large numbers, it may be worth dedicating some cores to this task.

      • realtimechris 2 days ago |
        So we can properly dispatch the correct length-based function with minimal branching to detect which length to serialize for: https://github.com/RealTimeChris/Jsonifier/blob/dev/Include/...
    • mabster 3 days ago |
      FYI, Hackers Delight uses a slight variation of Lemire's. He multiplies by 19/64 (so the divisor is a shift) and has the highest integer at the end of his table (i.e. 2^64 - 1). He also uses a shift instead of a conditional, so his code is branchless.

      Generally I prefer this kind of approach. But, I suspect your digit counting is a very hot path so your tables will generally be in cache. So your approach will likely win anyway.

  • dzaima 3 days ago |
    A latency improvement would be to have digitCountThresholds be indexed by the lzcnt, instead of the result of the other LUT. Increases the size of that lookup table from 160B to 512B though. Funnily enough I've had this approach in a local copy of Ryu when I was working on cutting it down for my purposes. Unfortunately whatever ended up public had cut this out too.

    edit: some chat messages of me at [0]. Some previous unrelated discussion found at [1].

    [0]: https://app.element.io/#/room/#bqn:matrix.org/$KKIK86x0tygAf... (arbitrarily capped at 18 digits because Ryu didn't need more)

    [1]: https://github.com/ulfjack/ryu/issues/34

    • adgjlsfhk1 2 days ago |
      you can cut back the table size at the cost of 1 cycle latency by shifting the lzcnt right by 3. this works since 2^3<10
      • dzaima 2 days ago |
        That doesn't work - the numbers change every 3 or 4 entries, i.e. ~log2(10), not 10. (and the precise point of change is important too)
        • adgjlsfhk1 2 days ago |
          Good point. The other option is to use (1233*(64-lzcnt(x)))>>12 to bypass the first lookup entirely (not sure if this is faster or not), which works since 1233/2^12 is close enough to log_2(10).
          • dzaima 2 days ago |
            Multiplication typically takes 3 cycles, so a total of 1+3+1=5 cycles of latency. A load from L1 takes 4-5 cycles on non-ancient x86 and Apple M1. So roughly on-par, perhaps a bit faster, at the cost of eating some ILP in the odd case that there's not still plenty left (also better if not already in-cache, though only significantly so if using that instead of the lzcnt directly as the threshold table index).
  • 0xcoffee 3 days ago |
    C# Version:

        private static uint FastDigitCount(ulong value)
        {
            ReadOnlySpan<byte> digitCounts = [19, 19, 19, 19, 18, 18, 18, 17, 17, 17, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 11, 11, 11, 10, 10, 10, 10, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1];
            ReadOnlySpan<ulong> digitCountThresholds = [0, 9, 99, 999, 9999, 99999, 999999, 9999999, 99999999, 999999999, 9999999999, 99999999999, 999999999999, 9999999999999, 99999999999999, 999999999999999, 9999999999999999, 99999999999999999, 999999999999999999, 9999999999999999999];
        
            var leadingZeros = BitOperations.LeadingZeroCount(value);
            var originalDigitCount = digitCounts[leadingZeros];
            return originalDigitCount + (value > digitCountThresholds[originalDigitCount] ? 1u : 0u);
        }
    
    Benchmark: https://stackoverflow.com/a/79337820/4503491
  • hairtuq 2 days ago |
    Here is a cute variant that doesn't need lzcnt nor tables, but only works up to 99,999:

        (((x + 393206) & (x + 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
    
    This is for integer log 10, but could be adapted for number of digits. It needs a wrapper for 64 bit to invoke it multiple times, but most numbers in a JSON are small, so it might even be competitive; it needs only 4 cycles with enough instruction level parallelism.

    I gathered this idea from the output of a superoptimizer, it was fun to figure out how it works. For spoilers, see [1].

    [1] https://github.com/rust-lang/rust/blob/master/library/core/s...

    • Aardwolf 2 days ago |
      > that doesn't need lzcnt

      Is it a big advantage to not need it, or can we safely assume CPU's have fast instructions for this today?

      • touisteur 2 days ago |
        Sometimes you have a scalar instruction but not a vectorized one, or it doesn't match the 'lanes' you want to operate (ISTR weird holes in AVX where I could find a specific instruction for 8, 32, 64-bits lanes but not for 16). Always good to have an escape hatch there, especially a highly pipelined (or pipeline-able) one.
    • Veedrac 2 days ago |
      It's not obvious to me that this is worth much over

          (x > 9) + (x > 99) + (x > 999) + (x > 9999)
      
      (x > CONST) does tend to need a pair of instructions to get the status flag into a register, but those can also fuse with the adds. Critical path latency is cmp-setae-sbb-add, or also four cycles.
  • billpg 2 days ago |
    If you were to ask me to write some code to convert an int64 into ascii decimals, I'd start with a 20 character buffer (because int64 will never need more), loop to populate the buffer with ascii decimal digits from the lowest significant end, then once the loop has finished and we know how many digits, copy the buffer into its place.

    I'd love to told that this method in the article is actually significantly more efficient than my possibly naïve approach.

    • matheusmoreira 2 days ago |
      > I'd start with a 20 character buffer (because int64 will never need more)

      Just in case anyone is wondering, this is because the maximum number of decimal digits for a given number of bits is given by:

        digits(bits) = ceil(bits * log10(2))
      
      This can be generalized to any number of bits.

        digits(32)   = 10
        digits(64)   = 20
        digits(128)  = 39
        digits(256)  = 78
        digits(512)  = 155
        digits(1024) = 309
  • Cold_Miserable 2 days ago |
    mov eax,64 lzcnt r8,rcx sub eax,r8d imul eax,1233 shr eax,12

    Accurate to within 1.