Fastest Branchless Binary Search
11 points by xlinux 3 days ago | 6 comments
  • pkhuong 3 days ago |

        auto length = last - first;
        while (length > 0) {
           auto half = length / 2;
           if (comp(first[half], value)) {
              // length - half equals half + length % 2
              first += length - half;
           }
           length = half;
        }
        return first;
    
    can be restructured with a conditional `first += half` and instead `length -= half` (see, e.g., Listing 2 in https://arxiv.org/abs/1509.05053). Doing it this way requires one final comparison when `length > 0` on entry, because the while condition is `length > 1`, but moves the subtraction off the critical path.
    • ryao 3 days ago |

        _Pragma("GCC unroll 9")
        while (nelems > 1)  {
          auto half = nelems / 2;
          nelems -= half;
          first += (comp(first[half - 1], value)) * half;
        }
        return first;
      
      As long as we are sharing improvements, here is mine. Microbenchmarking numerous variations found this performed the best. If your arrays are smaller than the unrolling factor, you don’t even loop. You have exactly 1 branch that is executed exactly once in that case. This is the closest to branch free that a binary search can be.

      I devised this a couple years ago for use in ZFS’ btree code:

      https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...

      Feel to have it under your choice of OSI approved license, including CC-0.

      • pkhuong 3 days ago |
        Yeah, that's the same structure. For simple comparisons, I would expect a conditional move, 0 branch except for the data-independent loop.
        • ryao 3 days ago |
          My recollection is that the branch predictor should mispredict the last branch of the loop. Unrolling eliminates that misprediction by replacing all of the loop iteration branches with a jump table branch that it will predict correctly more often. This is micro-optimizing to an extreme. I vaguely recall seeing a slight improvement in a microbenchmark that I attributed to this effect.
  • ryao 3 days ago |
    I implemented a variation of this in ZFS’ btree implementation a few years ago:

    https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...

    It only works if the comparator is inlined, which is why I had to use a macro and duplicate the code everywhere. C++’s templates effectively do the same thing.

    By the way, I notice that C++ favors comparators returning a boolean while C favors comparators returning -1, 0 or 1. Let me share this simple macro from ZFS:

    #define TREE_CMP(a, b) (((a) > (b)) - ((a) < (b)))

    Use that to generate -1, 0 or 1. Then if you want a boolean, compare the output to < 0 to get the boolean that a C++ comparator would return and as long as the comparator is inlined, the compiler will automatically simplify, such that you get only 1 comparison.

    The C++ STL designers who opted for a boolean return value from comparators did premature optimization, since they had no reason to break with the C convention under C++’s "the compiler will inline and optimize" philosophy. The neat thing about the C way is that if you want to generate a boolean from the comparator, you have a choice for the boolean’s semantics (< 0 or <= 0) while C++ made that choice for you.

    That said, I have never needed the latter and as a much younger developer, I thought that the C++ way was superior, but time has caused me to conclude the C guys got this right.

    Finally, since I am on topic of comparators, there is a fairly famous example of how to generate -1, 0 or 1 when comparing against 0:

    https://www.cs.cornell.edu/courses/cs6120/2020fa/blog/super-...

    I only mention it since it is interesting and I like sharing interesting things. Reimplementing the signum function from there with TREE_CMP() does not generate code as good as the superoptimized example in godbolt:

    https://godbolt.org/z/1j8nWjnqP

    It is probably an optimization opportunity for GCC.

  • BoingBoomTschak 3 days ago |
    Related repository of interest: https://github.com/scandum/binary_search

    "The most notable variant, the monobound binary search, executes two to four times faster than the standard binary search on arrays smaller than 1 million 32 bit integers."