Regardless of the cache associativity, the leading dimension must always correspond to a size that is a multiple of the cache line size (64 bytes for most CPUs) and the matrix must be aligned to the cache line size.
With that condition fulfilled, the total size of a row or of a column (depending on whether the matrix uses C order or Fortran order) may be optimal as a power of two or not, which should be better tested experimentally on the target CPUs or GPUs.
Matching the DRAM characteristics influences much less the performance of a linear algebra algorithm than matching the cache line size and the sizes of the L1 and L2 cache memories.
If you really want to tune an algorithm for the maximum performance that can be achieved on a particular computer, you must use an automatic tuning method, which runs benchmarks of that algorithm varying all matrix layout parameters until finding optimum values. An example of how this is done is provided by the ATLAS library that implements the BLAS API.
Such a library tuned for a given computer should be then used only for that computer.