• stoniejohnson a day ago |
    Great post! I appreciate the diagrams.
  • amelius a day ago |
    TL;DR: make sure your matrix dimensions are divisible by 2 often.
    • ykonstant a day ago |
      I have always done that instinctively, even when there is no formal demand by the system. Every time I have implemented matmuls, at any level, with any optimization requirement, partitioning into dyadic blocks had always sped things up. So I try to feed the system the nicest numbers I can muster!
      • zusammen 15 hours ago |
        D&C approaches are applicable to lots of problems and, as you’ve noticed, tend to do well with “round” (in binary) numbers.
    • carlmr 19 hours ago |
      And if you can't make sure all of them are divisible by 2 often, at least pick the inner dimensions of your matmuls (if possible).
    • chillee 13 hours ago |
      Well, that'll help with a lot :) But dealing with wave quantization requires dimensions that aren't neceessarily a multiple of 2, and often are a multiple of the number of SMs on a GPU (i.e. 132 on an H100)
  • jabl 20 hours ago |
    I recall some optimization advice to choose a leading dimension that is NOT a power of two, in order to avoid cache set associativity conflicts. Guess it's highly hardware dependent (in particular, that advice was for cpu's not GPU's).
    • adrian_b 13 hours ago |
      Many Intel CPUs have caches that have a number of ways that is not a power of two, but it contains a 3 or 5 factor, corresponding to cache sizes like 2.5 MB, 3 MB, 24 MB or 36 MB.

      Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.

      With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.

      • muziq 9 hours ago |
        DRAM page size surely ? All those cheap CASs’ ?
        • adrian_b an hour ago |
          Unlike the cache line size that is known, you cannot predict how the DRAM address bits are mapped to CPU or GPU addresses and which is the page size of the DIMMs that happen to be used.

          Matching the DRAM characteristics influences much less the performance of a linear algebra algorithm than matching the cache line size and the sizes of the L1 and L2 cache memories.

          If you really want to tune an algorithm for the maximum performance that can be achieved on a particular computer, you must use an automatic tuning method, which runs benchmarks of that algorithm varying all matrix layout parameters until finding optimum values. An example of how this is done is provided by the ATLAS library that implements the BLAS API.

          Such a library tuned for a given computer should be then used only for that computer.

    • dekhn 11 hours ago |
      That only makes sense if you know for sure your application is running on a specific architecture. Otherwise, it's a highly specific optimization that is bound to violate another architecture's design, also would be extremely challenging to debug.