MatmulKernel(3)







I have learned a lot about lower level programming! I may return to a couple of older posts, to ellaborate or add corrections.
But for now, let’s explore something new.

As a challenge, I wanted to touch on matrix multiplication written in Julia.
There are two chief advantages of this:

  1. The ability to get both generic and optimal code.
  2. The ability to get optimal performance at small and hopefully large sizes.

Libraries like Intel MKL and OpenBLAS (the default in Julia) are slow at small sizes. While MKL is proprietary, we can look at the source code of OpenBLAS and see that it has assembly kernels written for many different architectures.
Each time a new architecture comes out, before it’s supported, new kernels must be added, or old ones adapted! That is hard to maintain.

If we could rely on the compilers, which also get updates for each archtecture, it could make our lives a lot easier.
Of course, compilers can’t do everything, so we can use CpuId.jl to provide information about the CPU we’re running on.

I will be assuming you’re using an x86_64 processor (almost all laptop and desktop procesors are), but it would be good to support ARM or POWER too. I’m not familiar with thee however.

In this post, I will focus on optimizing a “compute kernel”. The compute kernel is a small core calculation, an atom that we build the larger operations out of.
Say we have $\textbf{D} = \textbf{A} \textbf{X}$, where $\textbf{A}$ is $M\times N$, $\textbf{X}$ is $N\times P$, and $\textbf{D}$ is $M\times P$. We can divide these into blocks such that:

$\begin{bmatrix}\textbf{D}_{1,1} & \textbf{D}_{1,2}\\\textbf{D}_{2,1} & \textbf{D}_{2,2}\end{bmatrix} = \begin{bmatrix}\textbf{A}_{1,1} & \textbf{A}_{1,2}\\\textbf{A}_{2,1} & \textbf{A}_{2,2}\end{bmatrix} \begin{bmatrix}\textbf{X}_{1,1} & \textbf{X}_{1,2}\\\textbf{X}_{2,1} & \textbf{X}_{2,2}\end{bmatrix}\\ = \begin{bmatrix} \textbf{A}_{1,1}\textbf{X}_{1,1} + \textbf{A}_{1,2}\textbf{X}_{2,1}& \textbf{A}_{1,1}\textbf{X}_{1,2} + \textbf{A}_{1,2}\textbf{X}_{2,2}\\ \textbf{A}_{2,1}\textbf{X}_{1,1} + \textbf{A}_{2,2}\textbf{X}_{2,1}& \textbf{A}_{2,1}\textbf{X}_{1,2} + \textbf{A}_{2,2}\textbf{X}_{2,2}\end{bmatrix}$

so in this way, we can break a larger computation down into blocks matching our kernel(s).
Combining these kernels into a larger operations will be the subject of another post — dealing with memory bottlenecks for large matrices is a challenge.

For now, let us focus just on the kernel itself. That is also the part most relevant to fast operations of small matrices.
To begin, lets consider a single core of an x86_64 processor:

  1. They have 16 (most) or 32 (AVX-512) registers.
  2. Each register holds 16 (old), 32 (most), or 64 (AVX-512) bytes.
  3. One instruction can load a chunk of data into a register, or broadcast it accross. Gather operations, loading bits of data that aren’t side by side, are comparatively very slow.
  4. One instruction can multiply, add, or a fused multiply-add (FMA) on all contents of a register.

Throughput for operations is complicated, but as a rule of theumb: the wider the vectors the better. Commonly cited caveats are (GCC follows these caveats by default, needing -mprefer-vector-width=NUMBER_OF_BITS to override it):

  1. Ryzen’s throughput is equal for 32 byte and 16 byte fma instructions, because the cores have 16 byte units that work together to act like a 32 byte unit. Intel processors have full sized units.
  2. Intel AVX-512 processors run at a slower clock rate when doing a lot of AVX-512 operations, because they generate a lot of heat.

However, in both cases, the larger vectors are still dramatically faster in numerical code that is easy to vectorize. Here, easy means “doesn’t need much code to arrange and line up data in the correct way”. Most linear algebra operations do not. Matrix multiplication in particular benefits from using far less memory operations the wider the vectors (and the more registers it holds).

To check some of these features of your CPU, you can use CpuId.jl as follows:

In [1]:
using CpuId
REGISTER_SIZE = simdbytes()
REGISTER_SIZE, cpufeature(CpuId.AVX512F), cpufeature(CpuId.FP256)
Out[1]:
(32, false, false)

This means on the computer I am currently writing this with, I have 32 byte (256 bit) SIMD instructions, but do not have the AVX512 foundation, and I don’t have native 256 bit floating point instructions — although they can still be used. That is because I am using a Ryzen processor:

In [2]:
versioninfo()
Julia Version 0.7.0-rc3.3
Commit 0600872 (2018-08-07 14:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, znver1)
In [3]:
simdbytes() ÷ sizeof(Float64) # How many Float64 fit in a single register
Out[3]:
4

simdbytes refers to the SIMD (single instruction, multiple data) vector width, while cpufeature(CpuId.AVX512F) checks if we have the AVX-512 foudnation extension, that doubles the number of registers from 16 to 32 (alongside other benefits). We don’t.

Now, moving onto how we can organize our operations, recall that the element $D_{m,p} = \sum_{n=1}^N A_{m,n} * X_{n,p}$.
That is, for an element of the answer, we move across the columns of $\textbf{A}$ and down the rows of $\textbf{X}$.
I will now be assuming a column major order, the standard in languages like Julia, R, MATLAB, and Fortran. Languages like C, C++, Rust, and Python are instead row major, but the same strategy would apply for $\textbf{D}^* = \textbf{X}^*\textbf{A}^*$, where $\textbf{D}^*$ is $P\times M$, $\textbf{X}^*$ is $P\times N$, and $\textbf{A}^*$ is $N\times M$.

Given the column major order, the elements in a column are contiguous, and can be efficiently loaded into or stored from a vector register together, while the elements in a row are not.
Our objective is to minimize how much we’re doing anything other than arithmetic operations. Part of that means loading and storing contents from registers as infrequently as possible, and every time we do load data from $\textbf{A}$ or $\textbf{X}$, we want to do as much as we can with it before loading something else in its place.

Lets define the shape of a kernel as an $m\times p$ block, corresponding for example to $\textbf{A}_{1,1}\textbf{X}_{1,1}$ above $-$ a sub-block we’re calculating on our way to getting the full answer.
Because the columns are contiguous, we want these columns to be made up of vectors stored in the registers. Naturally, $m$ should be some multiple of the size of our registers for the given data type. For reasons discussed later, also of the cachineline size, given by cachelinesize() in CpuId; $64$ bytes on this computer, hence we want m to be a multiple of 8.

Doing a little algebra, we see that:

$
D_{1,1} = A_{1,1}X_{1,1} + A_{1,2}X_{2,1} + A_{1,3}X_{3,1} +\dots\\
D_{2,1} = A_{2,1}X_{1,1} + A_{2,2}X_{2,1} + A_{2,3}X_{3,1} +\dots\\
D_{3,1} = A_{3,1}X_{1,1} + A_{3,2}X_{2,1} + A_{3,3}X_{3,1} +\dots\\
D_{4,1} = A_{4,1}X_{1,1} + A_{4,2}X_{2,1} + A_{4,3}X_{3,1} +\dots\\
D_{5,1} = A_{5,1}X_{1,1} + A_{5,2}X_{2,1} + A_{5,3}X_{3,1} +\dots\\
D_{6,1} = A_{6,1}X_{1,1} + A_{6,2}X_{2,1} + A_{6,3}X_{3,1} +\dots\\
D_{7,1} = A_{7,1}X_{1,1} + A_{7,2}X_{2,1} + A_{7,3}X_{3,1} +\dots\\
D_{8,1} = A_{8,1}X_{1,1} + A_{8,2}X_{2,1} + A_{8,3}X_{3,1} +\dots\\
$

Therefore, if we want to calculate the first column of $\textbf{D}$, we need to load vectors from the columns of $\textbf{A}$ and broadcast elements from $\textbf{X}$. We can reuse broadcasts from $\textbf{X}$ for each vector loaded per column of $\textbf{A}$. That is, we get an operation per $\lceil\frac{m}{\text{vector length}}\rceil$.
Similarly, the second column of $\textbf{D}$ looks like:

$
D_{1,2} = A_{1,1}X_{1,2} + A_{1,2}X_{2,2} + A_{1,3}X_{3,2} +\dots\\
D_{2,2} = A_{2,1}X_{1,2} + A_{2,2}X_{2,2} + A_{2,3}X_{3,2} +\dots\\
D_{3,2} = A_{3,1}X_{1,2} + A_{3,2}X_{2,2} + A_{3,3}X_{3,2} +\dots\\
D_{4,2} = A_{4,1}X_{1,2} + A_{4,2}X_{2,2} + A_{4,3}X_{3,2} +\dots\\
D_{5,2} = A_{5,1}X_{1,2} + A_{5,2}X_{2,2} + A_{5,3}X_{3,2} +\dots\\
D_{6,2} = A_{6,1}X_{1,2} + A_{6,2}X_{2,2} + A_{6,3}X_{3,2} +\dots\\
D_{7,2} = A_{7,1}X_{1,2} + A_{7,2}X_{2,2} + A_{7,3}X_{3,2} +\dots\\
D_{8,2} = A_{8,1}X_{1,2} + A_{8,2}X_{2,2} + A_{8,3}X_{3,2} +\dots\\
$

Now, we see that we can reuse loads from $\textbf{A}$ across columns of $\textbf{D}$.
For each load of $\textbf{A}$, we get an operation per column of the kernel, $m$.

This gives us a constrained optimization problem. We are constrained by the number of CPU registers, while trying to maximize the number of operations per load.
Choosing $m=8,p=6$, for each 2 loads from $\textbf{A}$ and 6 broadcasts from $\textbf{X}$, we get 12 multiplications, or 1.5 multiplications per memory operatinos.
This also requires $2*6=12$ registers of elements from $\textbf{D}$. As we move from $n=1$ to $n=N$, we also need 2 from $\textbf{A}$, and ideally a couple broadcasts from $\textbf{X}$. This totals to exactly 16 registers.

Choosing $m=16,p=2$ gives us 4 loads and 2 broadcasts per 8 arithmetic operations, or $\frac{8}{6}\approx 1.33$. This takes up $4*2+4+2 = 14$ registers.

To implement a kernel, we could rely on LLVM to autovectorize.
In my experience, it does a good job correctly using vector instructions, but seems to result in more unnecessary vmovapd instructions to move vectors from one register to another unnecessarilly. It does not seem to do this when we’re being explicit, and because we know exactly what we want to happen, we may as well not rely on the autovectorizer. SIMD.jl allows us to use LLVM vector intrinsics.

For now, we’ll also simply implement them for the StaticArrays.jl‘s MMatrix type, a mutable matrix of static size (known at compile time). Unfortunately, SIMD.jl only supports loading from the basic Array type and pointers (or broadcasting scalers), forcing us to use pointers here.

In [4]:
using SIMD, StaticArrays, LinearAlgebra

function kernel!(D::MMatrix{8,6,Float64},
                A::MMatrix{8,N,Float64},
                X::MMatrix{N,6,Float64}) where N
    pD, pA = pointer(D), pointer(A)
    V = Vec{8,Float64}

    Dx_1 = vload(V, pD + 64*0) # 8 bytes/element * 8 elements/column
    Dx_2 = vload(V, pD + 64*1)
    Dx_3 = vload(V, pD + 64*2)
    Dx_4 = vload(V, pD + 64*3)
    Dx_5 = vload(V, pD + 64*4)
    Dx_6 = vload(V, pD + 64*5)
    @inbounds for n  1:N # Here, we increase N.
        vA = vload(V, pA + (n-1)*64)
        Dx_1 = fma(vA, V(X[n,1]), Dx_1)
        Dx_2 = fma(vA, V(X[n,2]), Dx_2)
        Dx_3 = fma(vA, V(X[n,3]), Dx_3)
        Dx_4 = fma(vA, V(X[n,4]), Dx_4)
        Dx_5 = fma(vA, V(X[n,5]), Dx_5)
        Dx_6 = fma(vA, V(X[n,6]), Dx_6)
    end
    vstore(Dx_1, pD + 64*0)
    vstore(Dx_2, pD + 64*1)
    vstore(Dx_3, pD + 64*2)
    vstore(Dx_4, pD + 64*3)
    vstore(Dx_5, pD + 64*4)
    vstore(Dx_6, pD + 64*5)
    D
end

const D8_6  = @MMatrix zeros(8,6)
const A8_16 = @MMatrix randn(8,16)
const X16_6 = @MMatrix randn(16,6)

kernel!(D8_6, A8_16, X16_6)
Out[4]:
8×6 MArray{Tuple{8,6},Float64,2,48}:
 -0.120053   0.129256   11.0293    -3.00598   -1.40966    -4.69054
 -4.94158   -0.0207372   5.82366    2.52831   -0.921697   -2.68802
 -8.56774    4.60838    12.1258    -2.07117    0.0488104  -6.53091
  1.0102     0.306962    0.510806   0.716743   6.0147     -0.799797
 -4.57225    3.54648     8.08041   -0.801935   0.961744   -2.82873
 -0.627791  -2.17039    -8.23259    1.50961    0.502956    2.56525
  2.05503    0.0285736  -4.36238   -0.333866  -3.13564    -0.382945
 -2.57792    2.76625    -4.77654    5.44652   -1.38794    -5.38515 
In [5]:
mul!(D8_6, A8_16, X16_6) # Compare answer with StaticArrays.jl's mul.
Out[5]:
8×6 MArray{Tuple{8,6},Float64,2,48}:
 -0.120053   0.129256   11.0293    -3.00598   -1.40966    -4.69054
 -4.94158   -0.0207372   5.82366    2.52831   -0.921697   -2.68802
 -8.56774    4.60838    12.1258    -2.07117    0.0488104  -6.53091
  1.0102     0.306962    0.510806   0.716743   6.0147     -0.799797
 -4.57225    3.54648     8.08041   -0.801935   0.961744   -2.82873
 -0.627791  -2.17039    -8.23259    1.50961    0.502956    2.56525
  2.05503    0.0285736  -4.36238   -0.333866  -3.13564    -0.382945
 -2.57792    2.76625    -4.77654    5.44652   -1.38794    -5.38515 

Note that the LinearAlgebra and StaticArrays mul! function calculates $\textbf{D} = \textbf{AX}$, while our kernel calculates $\textbf{D} = \textbf{D} + \textbf{AX}$, because the blockwise summation looks mostly like the latter.
Nonetheless, we can time it compared to the StaticArrays function:

In [6]:
using BenchmarkTools
@benchmark kernel!(D8_6, A8_16, X16_6)
Out[6]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     54.977 ns (0.00% GC)
  median time:      55.903 ns (0.00% GC)
  mean time:        57.104 ns (0.00% GC)
  maximum time:     86.223 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     985
In [7]:
@benchmark mul!(D8_6, A8_16, X16_6)
Out[7]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     110.435 ns (0.00% GC)
  median time:      113.655 ns (0.00% GC)
  mean time:        115.936 ns (0.00% GC)
  maximum time:     231.914 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     927

and see quite an improvement.
Let us also briefly take an aside to consider C and a Fortran implementations.
This C code is comparable to the Julia code, relying on vector intrinsics:

#include<immintrin.h>

void kernel8x16x6(double* D, double* A, double* X, long N){
    __m256d D11, D12, D21, D22, D31, D32, D41, D42, D51, D52, D61, D62;
    __m256d A1, A2, X1, X2, X3, X4, X5, X6;
    D11 = _mm256_loadu_pd(D);
    D12 = _mm256_loadu_pd(D + 4);
    D21 = _mm256_loadu_pd(D + 8);
    D22 = _mm256_loadu_pd(D + 12);
    D31 = _mm256_loadu_pd(D + 16);
    D32 = _mm256_loadu_pd(D + 20);
    D41 = _mm256_loadu_pd(D + 24);
    D42 = _mm256_loadu_pd(D + 28);
    D51 = _mm256_loadu_pd(D + 32);
    D52 = _mm256_loadu_pd(D + 36);
    D61 = _mm256_loadu_pd(D + 40);
    D62 = _mm256_loadu_pd(D + 44);
    for(long i = 0; i < N; i++){
        A1 = _mm256_loadu_pd(A + 8*i);
        A2 = _mm256_loadu_pd(A + 8*i + 4);
        X1 = _mm256_broadcast_sd(X+i);
        D11 = _mm256_fmadd_pd(A1, X1, D11);
        D12 = _mm256_fmadd_pd(A2, X1, D12);
        X2 = _mm256_broadcast_sd(X+(N+i));
        D21 = _mm256_fmadd_pd(A1, X2, D21);
        D22 = _mm256_fmadd_pd(A2, X2, D22);
        X3 = _mm256_broadcast_sd(X+(2*N+i));
        D31 = _mm256_fmadd_pd(A1, X3, D31);
        D32 = _mm256_fmadd_pd(A2, X3, D32);
        X4 = _mm256_broadcast_sd(X+(3*N+i));
        D41 = _mm256_fmadd_pd(A1, X4, D41);
        D42 = _mm256_fmadd_pd(A2, X4, D42);
        X5 = _mm256_broadcast_sd(X+(4*N+i));
        D51 = _mm256_fmadd_pd(A1, X5, D51);
        D52 = _mm256_fmadd_pd(A2, X5, D52);
        X6 = _mm256_broadcast_sd(X+(5*N+i));
        D61 = _mm256_fmadd_pd(A1, X6, D61);
        D62 = _mm256_fmadd_pd(A2, X6, D62);
    }
    _mm256_storeu_pd(D, D11);
    _mm256_storeu_pd(D + 4, D12);
    _mm256_storeu_pd(D + 8, D21);
    _mm256_storeu_pd(D + 12, D22);
    _mm256_storeu_pd(D + 16, D31);
    _mm256_storeu_pd(D + 20, D32);
    _mm256_storeu_pd(D + 24, D41);
    _mm256_storeu_pd(D + 28, D42);
    _mm256_storeu_pd(D + 32, D51);
    _mm256_storeu_pd(D + 36, D52);
    _mm256_storeu_pd(D + 40, D61);
    _mm256_storeu_pd(D + 44, D62);
}

A couple comments:

  1. I’d read that larger vectors are not automatically split into multiple instances of smaller vectors in an efficient way, unlike the LLVM intrinsics in Julia. Therefore, we need to split each column in two chunks. I have not yet tested if this is still true to verify.
  2. We need to remember substantially more. Defining the vector type we want in Julia is staightforward, eg Vec{4,Float64} specifies a vector of 4 doubles, and then we can use this type definition to dispatch on the correct functions.
  3. I used “long” for the integet values, because Int in Julia defaults to 64 bit integers.

For Fortran I didn’t use vector intrinsics.
The do loop (ie, for loop in other languages) failed to optimize correctly using the compilers and flags I tried(gfortran, flang, ifort). It either stored and loaded from registers unnecessarily, or tried to vectorize across the loops instead of within loops.
I filed an issue on gcc’s bugzilla here (despite the bug’s name, I’d prefer not using funroll-loops):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

Manaully unrolling the loop probably really isn’t worth it. I’ll spare you the full code (you can download it from the above link), but it looks like this:

subroutine mulkernel(D, A, X)
        real(8), dimension(8,16),   intent(in)  :: A
        real(8), dimension(16,6),   intent(in)  :: X
        real(8), dimension(8,6),    intent(out) :: D

        D(1:8,1) = D(1:8,1) + A(1:8,1) * X(1,1)
        D(1:8,2) = D(1:8,2) + A(1:8,1) * X(1,2)
        D(1:8,3) = D(1:8,3) + A(1:8,1) * X(1,3)
        D(1:8,4) = D(1:8,4) + A(1:8,1) * X(1,4)
        D(1:8,5) = D(1:8,5) + A(1:8,1) * X(1,5)
        D(1:8,6) = D(1:8,6) + A(1:8,1) * X(1,6)

        D(1:8,1) = D(1:8,1) + A(1:8,2) * X(2,1)
        D(1:8,2) = D(1:8,2) + A(1:8,2) * X(2,2)
        D(1:8,3) = D(1:8,3) + A(1:8,2) * X(2,3)
        D(1:8,4) = D(1:8,4) + A(1:8,2) * X(2,4)
        D(1:8,5) = D(1:8,5) + A(1:8,2) * X(2,5)
        D(1:8,6) = D(1:8,6) + A(1:8,2) * X(2,6)

simply repeating these blocks 16 times, for n = 1:16. That means unlike the Julia and C code, we awkwardly need N=16. Unrolling bloats code size here, without real benefit because each loop already has 12 fma instructions.
I compiled with (gcc version 8.1):
gcc -O2 -march=haswell -shared -fPIC loopkernel.c -o libcloopkernel.so
gfortran -Ofast -march=haswell -shared -fPIC fortunrolled.f90 -o libfunrolledkernel.so
The architecture is znver1, but choosing Haswell often results in much faster code — in particular, it does here. -mprefer-vector-width=256 gets you some of the way, but not all — it still ended up using a bunch of silly memory operations that made the Fortran code run 2x slower. So haswell it is.

In [8]:
fpath = "/home/chris/Documents/progwork/julia/matmul"
const mullib = joinpath(fpath, "libfunrolledkernel.so")
const cmullib = joinpath(fpath, "libcloopkernel.so")
function cmul!(D::MMatrix{8,6,Float64},A::MMatrix{8,N,Float64},X::MMatrix{N,6,Float64}) where N
    ccall((:kernel8x16x6, cmullib), Cvoid, (Ptr{Float64},Ptr{Float64},Ptr{Float64}, Clong), D, A, X, N)
    D
end
function fmul!(D::MMatrix{8,6,Float64},A::MMatrix{8,16,Float64},X::MMatrix{16,6,Float64})
    ccall((:__kernels_MOD_mulkernel, mullib), Cvoid, (Ptr{Float64},Ptr{Float64},Ptr{Float64}), D, A, X)
    D
end
D8_6 .= 0; cmul!(D8_6, A8_16, X16_6)
Out[8]:
8×6 MArray{Tuple{8,6},Float64,2,48}:
 -0.120053   0.129256   11.0293    -3.00598   -1.40966    -4.69054
 -4.94158   -0.0207372   5.82366    2.52831   -0.921697   -2.68802
 -8.56774    4.60838    12.1258    -2.07117    0.0488104  -6.53091
  1.0102     0.306962    0.510806   0.716743   6.0147     -0.799797
 -4.57225    3.54648     8.08041   -0.801935   0.961744   -2.82873
 -0.627791  -2.17039    -8.23259    1.50961    0.502956    2.56525
  2.05503    0.0285736  -4.36238   -0.333866  -3.13564    -0.382945
 -2.57792    2.76625    -4.77654    5.44652   -1.38794    -5.38515 
In [9]:
@benchmark cmul!(D8_6, A8_16, X16_6)
Out[9]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     48.684 ns (0.00% GC)
  median time:      50.865 ns (0.00% GC)
  mean time:        52.616 ns (0.00% GC)
  maximum time:     89.734 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988
In [10]:
D8_6 .= 0; fmul!(D8_6, A8_16, X16_6)
Out[10]:
8×6 MArray{Tuple{8,6},Float64,2,48}:
 -0.120053   0.129256   11.0293    -3.00598   -1.40966    -4.69054
 -4.94158   -0.0207372   5.82366    2.52831   -0.921697   -2.68802
 -8.56774    4.60838    12.1258    -2.07117    0.0488104  -6.53091
  1.0102     0.306962    0.510806   0.716743   6.0147     -0.799797
 -4.57225    3.54648     8.08041   -0.801935   0.961744   -2.82873
 -0.627791  -2.17039    -8.23259    1.50961    0.502956    2.56525
  2.05503    0.0285736  -4.36238   -0.333866  -3.13564    -0.382945
 -2.57792    2.76625    -4.77654    5.44652   -1.38794    -5.38515 
In [11]:
@benchmark fmul!(D8_6, A8_16, X16_6)
Out[11]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     49.069 ns (0.00% GC)
  median time:      51.241 ns (0.00% GC)
  mean time:        52.605 ns (0.00% GC)
  maximum time:     81.409 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     987

Julia’s performance here was fairly close. I’ll use that as justification to focus on Julia for now.
Let us now consider the 16xNx2 kernel:

In [12]:
function kernel!(D::MMatrix{16,2,Float64},
                A::MMatrix{16,N,Float64},
                X::MMatrix{N,2,Float64}) where N
    pD, pA = pointer(D), pointer(A)
    V = Vec{16,Float64}

    vA = vload(V, pA)
    @inbounds begin
        Dx1 = vload(V, pD + 128*0)
        Dx2 = vload(V, pD + 128*1)
        for n  1:N # Here, we increase N.
            vA = vload(V, pA + (n-1)*128) # 8 bytes/element * 16 elements/column
            Dx1 = fma(vA, V(X[n,1]), Dx1)
            Dx2 = fma(vA, V(X[n,2]), Dx2)
        end
    end
    vstore(Dx1, pD + 128*0)
    vstore(Dx2, pD + 128*1)
    D
end
const D16_2 = @MMatrix zeros(16,2)
const A16_24 = @MMatrix randn(16,24)
const X24_2 = @MMatrix randn(24,2)
kernel!(D16_2, A16_24, X24_2)
Out[12]:
16×2 MArray{Tuple{16,2},Float64,2,32}:
  -1.07187   -6.5252
   0.220209  -5.51822
   2.52272   -0.746445
 -11.1554     7.06727
  -1.21545    9.74906
  -2.17784   -2.11581
  -2.30142    4.32875
   4.31754   -8.36784
   2.46978   -2.0183
   0.655119   0.787527
  11.7831    -4.91884
  -7.03475    1.36715
   4.51619   -5.03703
   0.128578   0.636429
   4.08147    1.41495
   1.16042   -1.32004 
In [13]:
mul!(D16_2, A16_24, X24_2)
Out[13]:
16×2 MArray{Tuple{16,2},Float64,2,32}:
  -1.07187   -6.5252
   0.220209  -5.51822
   2.52272   -0.746445
 -11.1554     7.06727
  -1.21545    9.74906
  -2.17784   -2.11581
  -2.30142    4.32875
   4.31754   -8.36784
   2.46978   -2.0183
   0.655119   0.787527
  11.7831    -4.91884
  -7.03475    1.36715
   4.51619   -5.03703
   0.128578   0.636429
   4.08147    1.41495
   1.16042   -1.32004 
In [14]:
@benchmark kernel!(D16_2, A16_24, X24_2)
Out[14]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     61.739 ns (0.00% GC)
  median time:      68.205 ns (0.00% GC)
  mean time:        69.153 ns (0.00% GC)
  maximum time:     106.078 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     980
In [15]:
@benchmark mul!(D16_2, A16_24, X24_2)
Out[15]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     124.063 ns (0.00% GC)
  median time:      127.344 ns (0.00% GC)
  mean time:        130.777 ns (0.00% GC)
  maximum time:     255.916 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     898

As expected, the $8\times6$ kernel seems to perform better.
Here, I tried to keep the size constant. Ie, $8*16*6=16*24*2=768$. This is the number of multiplication operations needed, ie each of the $m\times p$ entries of the output matrix is a dot product of the vectors of length $n$ that make up the rows of $\textbf{A}$ and columns of $\textbf{B}$.
We can look at the generated assembly:

In [16]:
@code_native kernel!(D8_6, A8_16, X16_6)
	.text
; Function kernel! {
; Location: In[4]:6
	movq	%rsi, -8(%rsp)
	movq	(%rsi), %rax
	movq	8(%rsi), %rcx
	movq	16(%rsi), %rdx
	movq	$-128, %rsi
; Location: In[4]:9
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	(%rax), %ymm3
	vmovupd	32(%rax), %ymm4
;}}}
; Location: In[4]:10
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	64(%rax), %ymm5
	vmovupd	96(%rax), %ymm6
;}}}
; Location: In[4]:11
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	160(%rax), %ymm0
	vmovupd	128(%rax), %ymm7
;}}}
; Location: In[4]:12
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	224(%rax), %ymm1
	vmovupd	192(%rax), %ymm8
;}}}
; Location: In[4]:13
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	288(%rax), %ymm2
	vmovupd	256(%rax), %ymm9
;}}}
; Location: In[4]:14
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	352(%rax), %ymm11
	vmovupd	320(%rax), %ymm10
	nopw	(%rax,%rax)
;}}}}
; Function kernel! {
; Location: SIMD.jl
L112:
	vmovapd	%ymm11, %ymm12
;}
; Function kernel! {
; Location: In[4]:16
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	1024(%rcx,%rsi,8), %ymm13
;}}}
; Location: In[4]:17
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	128(%rdx,%rsi), %ymm14
;}}
; Location: In[4]:16
; Function vload; {
; Location: SIMD.jl:1236
; Function vload; {
; Location: SIMD.jl:1236
; Function macro expansion; {
; Location: SIMD.jl:1260
	vmovupd	1056(%rcx,%rsi,8), %ymm11
;}}}
; Location: In[4]:22
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	768(%rdx,%rsi), %ymm15
;}}
; Location: In[4]:17
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm14, %ymm11, %ymm4
	vfmadd231pd	%ymm14, %ymm13, %ymm3
;}}}}
; Location: In[4]:18
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	256(%rdx,%rsi), %ymm14
;}}
; Location: In[4]:22
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm15, %ymm13, %ymm10
;}}}}
; Location: In[4]:18
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm14, %ymm11, %ymm6
	vfmadd231pd	%ymm14, %ymm13, %ymm5
;}}}}
; Location: In[4]:19
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	384(%rdx,%rsi), %ymm14
;}}
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm14, %ymm11, %ymm0
	vfmadd231pd	%ymm14, %ymm13, %ymm7
;}}}}
; Location: In[4]:20
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	512(%rdx,%rsi), %ymm14
;}}
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm14, %ymm11, %ymm1
	vfmadd231pd	%ymm14, %ymm13, %ymm8
;}}}}
; Location: In[4]:21
; Function Type; {
; Location: SIMD.jl:114
; Function macro expansion; {
; Location: SIMD.jl:116
	vbroadcastsd	640(%rdx,%rsi), %ymm14
;}}
; Location: In[4]:22
; Function iterate; {
; Location: range.jl:575
; Function ==; {
; Location: promotion.jl:425
	addq	$8, %rsi
;}}
; Location: In[4]:21
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd231pd	%ymm14, %ymm11, %ymm2
	vfmadd231pd	%ymm14, %ymm13, %ymm9
;}}}}
; Location: In[4]:22
; Function fma; {
; Location: SIMD.jl:1010
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function llvmwrap; {
; Location: SIMD.jl:666
; Function macro expansion; {
; Location: SIMD.jl:688
	vfmadd213pd	%ymm12, %ymm15, %ymm11
;}}}}
	jne	L112
; Location: In[4]:24
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm3, (%rax)
	vmovupd	%ymm4, 32(%rax)
;}}}
; Location: In[4]:25
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm5, 64(%rax)
	vmovupd	%ymm6, 96(%rax)
;}}}
; Location: In[4]:26
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm7, 128(%rax)
	vmovupd	%ymm0, 160(%rax)
;}}}
; Location: In[4]:27
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm8, 192(%rax)
	vmovupd	%ymm1, 224(%rax)
;}}}
; Location: In[4]:28
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm9, 256(%rax)
	vmovupd	%ymm2, 288(%rax)
;}}}
; Location: In[4]:29
; Function vstore; {
; Location: SIMD.jl:1337
; Function vstore; {
; Location: SIMD.jl:1337
; Function macro expansion; {
; Location: SIMD.jl:1361
	vmovupd	%ymm10, 320(%rax)
	vmovupd	%ymm11, 352(%rax)
;}}}
; Location: In[4]:30
	vzeroupper
	retq
;}

At L144, we see vmovapd %ymm11, %ymm12 which is probably unnecessary (moving contents from register 11 to register 12), and gets executed once per iteration of the for loop.
It seems to happen because of the after the last broadcast, there is a vfmadd instruction that writes onto register number 11, instead of register 12 where it is actually trying to accumulate from D:
vfmadd213pd %ymm12, %ymm14, %ymm11

Interestingly, I did experiment with lightly editing the above assembly and compiling it with gcc, and it was about as fast as the C code.
Of course, we’re less interested in the kernels themselves than we are multiplying larger dense matrices. Let us compare to OpenBLAS:

In [17]:
@show BLAS.vendor()
BLAS.set_num_threads(1)
const D = Matrix{Float64}(undef,800,600)
const A, X = randn(800,1600), randn(1600,600)
@benchmark mul!(D, A, X)
BLAS.vendor() = :openblas64
Out[17]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.442 ms (0.00% GC)
  median time:      54.065 ms (0.00% GC)
  mean time:        54.643 ms (0.00% GC)
  maximum time:     57.838 ms (0.00% GC)
  --------------
  samples:          92
  evals/sample:     1

Our kernel is about the same speed. Assuming O(n^3) scaling, the above should take $(10^2)^3=10^6$ times longer, or about 53.5 ms (we get to swap ns for ms), as we see.
For what it’s worth, calling the kernel a million times here mostly seems to improve consistency:

In [18]:
@benchmark foreach(i -> kernel!(D8_6, A8_16, X16_6), 1:10^6)
Out[18]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.076 ms (0.00% GC)
  median time:      55.946 ms (0.00% GC)
  mean time:        56.029 ms (0.00% GC)
  maximum time:     60.084 ms (0.00% GC)
  --------------
  samples:          90
  evals/sample:     1

but when we get to it, we are going to see a bigger improvement from increasing N to decrease the relative frequency of reads and writes from D. Getting it so that we aren’t bottlenecked by memory will be the bigger problem.

Now, the advantages of writing in Julia are that it lets us write generic optimized code. Lets break this down into two problems: functions to choose the kernel size, and making the pattern agnostic to size of the inputs.
We’ll give a simple example of the latter with MMatrices here, with one that accounts for strides across columns when we put things together later.

Here, I am $-$ like in C $-$ defining the SIMD vectors to be of the exact number of bytes, to save the compiler from the work of splitting them. When I didn’t do this initially, some of the kernels took over a second to compile, and a few even failed to vectorize correctly.

In [19]:
using Base.Cartesian

@generated function generic_kernel!(D::MMatrix{M,P,T},
                        A::MMatrix{M,N,T},
                        X::MMatrix{N,P,T}) where {M,P,N,T}
    MSoT = M*sizeof(T)
    L = REGISTER_SIZE ÷ sizeof(T)
    Q, r = divrem(M, L) #Assuming M is a multiple of L
    r == 0 || throw("Number of rows $M not a multiple of register size: $REGISTER_SIZE.")
    V = Vec{L,T}
    quote
        pD, pA = pointer(D), pointer(A)

        @nexprs $P p -> @nexprs $Q q -> Dx_p_q = vload($V, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
        @inbounds for n  1:N
            @nexprs $P p -> begin
                vX = $V(X[n,p])
                @nexprs $Q q -> begin
                    vA_q = vload($V, pA + (n-1)*$MSoT + $REGISTER_SIZE*(q-1))
                    Dx_p_q = fma(vA_q, vX, Dx_p_q)
                end
            end
        end
        @nexprs $P p -> @nexprs $Q q -> vstore(Dx_p_q, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
        D
    end
end
Out[19]:
generic_kernel! (generic function with 1 method)
In [20]:
D8_6 .= 0; generic_kernel!(D8_6, A8_16, X16_6)
Out[20]:
8×6 MArray{Tuple{8,6},Float64,2,48}:
 -0.120053  -0.120053  -0.120053  -0.120053  -0.120053  -0.120053
 -4.94158   -4.94158   -4.94158   -4.94158   -4.94158   -4.94158
 -8.56774   -8.56774   -8.56774   -8.56774   -8.56774   -8.56774
  1.0102     1.0102     1.0102     1.0102     1.0102     1.0102
 -4.57225   -4.57225   -4.57225   -4.57225   -4.57225   -4.57225
 -0.627791  -0.627791  -0.627791  -0.627791  -0.627791  -0.627791
  2.05503    2.05503    2.05503    2.05503    2.05503    2.05503
 -2.57792   -2.57792   -2.57792   -2.57792   -2.57792   -2.57792 
In [21]:
@benchmark generic_kernel!(D8_6, A8_16, X16_6)
Out[21]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     53.064 ns (0.00% GC)
  median time:      53.970 ns (0.00% GC)
  mean time:        54.741 ns (0.00% GC)
  maximum time:     80.476 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     985

If you’re unfamiliar with generated functions, try removing the @generated macro, and calling the function, and you’ll see a quoted julia expression.
In a generated function, you can use type information of the arguments in building a Julia expression, which the function then returns. The first time a generated function gets called for a combination of input types, it gets run, and then this quoted expression is what gets compiled.

You can also use @macroexpand to explore the behavior of Base.Cartesian‘s @nexprs. For examply, try

@macroexpand @nexprs 3 i -> @nexprs 2 j -> d_i_j = (i,j)

This is a convenient way of writing unrolled expressions.

In [22]:
# const REGISTER_SIZE = CpuId.simdbytes() # defined earlier
const REGISTER_COUNT = CpuId.cpufeature(CpuId.AVX512F) ? 32 : 16
const CACHELINE_SIZE = CpuId.cachelinesize()
function kernel_size_summary(::Type{T} = Float64,
            register_size = REGISTER_SIZE, register_count = REGISTER_COUNT) where T
    t_size = sizeof(T)
    num_per_register = register_size ÷ t_size
    max_total = num_per_register * register_count
    cache_line = CACHELINE_SIZE ÷ t_size
    num_cache_lines = cld(max_total, cache_line)
    summary = Matrix{Float64}(undef, 5, num_cache_lines)
    for num_row_cachelines  1:num_cache_lines
        num_rows = num_row_cachelines * cache_line
        a_loads = cld(num_rows, num_per_register)
        num_cols = (register_count - a_loads - 2) ÷ a_loads
        if num_cols == 0
            println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
            return summary[:, 1:num_row_cachelines-1]
        end
        summary[:, num_row_cachelines] .= (num_rows, num_cols, num_rows * num_cols, num_cols + a_loads, num_rows * num_cols / (num_cols + a_loads))
    end
    println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
    summary
end
Out[22]:
kernel_size_summary (generic function with 4 methods)
In [23]:
kernel_size_summary(Float64)
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[23]:
5×3 Array{Float64,2}:
  8.0  16.0      24.0
  6.0   2.0       1.0
 48.0  32.0      24.0
  8.0   6.0       7.0
  6.0   5.33333   3.42857
In [24]:
kernel_size_summary(Float32)
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[24]:
5×3 Array{Float64,2}:
 16.0  32.0     48.0
  6.0   2.0      1.0
 96.0  64.0     48.0
  8.0   6.0      7.0
 12.0  10.6667   6.85714
In [25]:
kernel_size_summary(Complex{Float64})
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[25]:
5×3 Array{Float64,2}:
  4.0   8.0      12.0
  6.0   2.0       1.0
 24.0  16.0      12.0
  8.0   6.0       7.0
  3.0   2.66667   1.71429

This suggests 6 columns is generally the way to go.
In practice, we can have the function simply return the combination that produces the highest value in the bottom row.

We’ll focus on performance of structs of multiple numbers like Complex{Float64} and DualNumbers later.
I haven’t looked at these yet, but I imagine StructsOfArrays.jl may be the way to go for performance. Switching to a computer with AVX512:

In [1]:
versioninfo()
Julia Version 0.7.0-rc3.3
Commit 0600872e7f (2018-08-07 14:34 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
In [2]:
using CpuId
simdbytes(), cpufeature(CpuId.AVX512F), cpufeature(CpuId.FP256)
Out[2]:
(64, true, true)
In [3]:
const REGISTER_SIZE = CpuId.simdbytes()
const REGISTER_COUNT = CpuId.cpufeature(CpuId.AVX512F) ? 32 : 16
const CACHELINE_SIZE = CpuId.cachelinesize()
function kernel_size_summary(::Type{T} = Float64,
            register_size = REGISTER_SIZE, register_count = REGISTER_COUNT) where T
    t_size = sizeof(T)
    num_per_register = register_size ÷ t_size
    max_total = num_per_register * register_count
    cache_line = CACHELINE_SIZE ÷ t_size
    num_cache_lines = cld(max_total, cache_line)
    summary = Matrix{Float64}(undef, 5, num_cache_lines)
    for num_row_cachelines  1:num_cache_lines
        num_rows = num_row_cachelines * cache_line
        a_loads = cld(num_rows, num_per_register)
        num_cols = (register_count - a_loads - 2) ÷ a_loads
        if num_cols == 0
            println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
            return summary[:, 1:num_row_cachelines-1]
        end
        summary[:, num_row_cachelines] .= (num_rows, num_cols, num_rows * num_cols, num_cols + a_loads, num_rows * num_cols / (num_cols + a_loads))
    end
    println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
    summary
end
kernel_size_summary(Float64)
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[3]:
5×15 Array{Float64,2}:
   8.0       16.0   24.0   32.0   40.0   48.0  …  104.0      112.0      120.0
  29.0       14.0    9.0    6.0    5.0    4.0       1.0        1.0        1.0
 232.0      224.0  216.0  192.0  200.0  192.0     104.0      112.0      120.0
  30.0       16.0   12.0   10.0   10.0   10.0      14.0       15.0       16.0
   7.73333   14.0   18.0   19.2   20.0   19.2       7.42857    7.46667    7.5
In [4]:
kernel_size_summary(Float32)
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[4]:
5×15 Array{Float64,2}:
  16.0      32.0   48.0   64.0   80.0  …  192.0     208.0     224.0     240.0
  29.0      14.0    9.0    6.0    5.0       1.0       1.0       1.0       1.0
 464.0     448.0  432.0  384.0  400.0     192.0     208.0     224.0     240.0
  30.0      16.0   12.0   10.0   10.0      13.0      14.0      15.0      16.0
  15.4667   28.0   36.0   38.4   40.0      14.7692   14.8571   14.9333   15.0
In [5]:
kernel_size_summary(Complex{Float64})
A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded
Out[5]:
5×15 Array{Float64,2}:
   4.0        8.0   12.0  16.0   20.0  …  48.0      52.0      56.0      60.0
  29.0       14.0    9.0   6.0    5.0      1.0       1.0       1.0       1.0
 116.0      112.0  108.0  96.0  100.0     48.0      52.0      56.0      60.0
  30.0       16.0   12.0  10.0   10.0     13.0      14.0      15.0      16.0
   3.86667    7.0    9.0   9.6   10.0      3.69231   3.71429   3.73333   3.75
In [6]:
using Base.Cartesian, SIMD, StaticArrays, BenchmarkTools

@generated function generic_kernel!(D::MMatrix{M,P,T},
                        A::MMatrix{M,N,T},
                        X::MMatrix{N,P,T}) where {M,P,N,T}
    MSoT = M*sizeof(T)
    L = REGISTER_SIZE ÷ sizeof(T)
    Q, r = divrem(M, L) #Assuming M is a multiple of L
    r == 0 || throw("Number of rows $M not a multiple of register size: $REGISTER_SIZE.")
    V = Vec{L,T}
    quote
        pD, pA = pointer(D), pointer(A)

        @nexprs $P p -> @nexprs $Q q -> Dx_p_q = vload($V, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
        @inbounds for n  1:N
            @nexprs $P p -> begin
                vX = $V(X[n,p])
                @nexprs $Q q -> begin
                    vA_q = vload($V, pA + (n-1)*$MSoT + $REGISTER_SIZE*(q-1))
                    Dx_p_q = fma(vA_q, vX, Dx_p_q)
                end
            end
        end
        @nexprs $P p -> @nexprs $Q q -> vstore(Dx_p_q, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
        D
    end
end

const D16_14 = @MMatrix zeros(16,14)
const A16_25 = @MMatrix randn(16,25)
const X25_14 = @MMatrix randn(25,14)

const D40_5 = @MMatrix zeros(40,5)
const A40_28 = @MMatrix randn(40,28)
const X28_5 = @MMatrix randn(28,5)

@benchmark generic_kernel!(D16_14, A16_25, X25_14)
Out[6]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     106.204 ns (0.00% GC)
  median time:      106.486 ns (0.00% GC)
  mean time:        108.114 ns (0.00% GC)
  maximum time:     158.706 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     934
In [7]:
@benchmark generic_kernel!(D40_5, A40_28, X28_5)
Out[7]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     105.034 ns (0.00% GC)
  median time:      105.280 ns (0.00% GC)
  mean time:        107.620 ns (0.00% GC)
  maximum time:     160.043 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     935
In [8]:
const sD32_14 = @MMatrix zeros(Float32, 32,14)
const sA32_25 = @MMatrix randn(Float32, 32,25)
const sX25_14 = @MMatrix randn(Float32, 25,14)

const sD80_5 = @MMatrix zeros(Float32, 80,5)
const sA80_28 = @MMatrix randn(Float32, 80,28)
const sX28_5 = @MMatrix randn(Float32, 28,5)
@benchmark generic_kernel!(sD32_14, sA32_25, sX25_14)
Out[8]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     104.927 ns (0.00% GC)
  median time:      105.109 ns (0.00% GC)
  mean time:        110.124 ns (0.00% GC)
  maximum time:     154.205 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     927
In [9]:
@benchmark generic_kernel!(sD80_5, sA80_28, sX28_5)
Out[9]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     102.860 ns (0.00% GC)
  median time:      103.213 ns (0.00% GC)
  mean time:        107.577 ns (0.00% GC)
  maximum time:     183.533 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     940

This difference looks very small. Further testing is probably necessary to see if the $P=5$ kernel is actually optimal.

Also, worth noting the awesome performance of AVX512 here. Scaling up the minimum time for the generic kernel run on Ryzen:

In [10]:
53.064 * (16 / 8) * (25 / 16) * (14 / 6)
Out[10]:
386.925

That is an estimated roughly 390ns runtime, vs the little more than 100ns we observed.

In [11]:
const Dbig = randn(1600, 1400)
const Abig = randn(1600, 2500)
const Xbig = randn(2500, 1400)
using LinearAlgebra
@show BLAS.vendor()
BLAS.set_num_threads(1)
@benchmark mul!(Dbig, Abig, Xbig)
BLAS.vendor() = :openblas64
Out[11]:
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     213.896 ms (0.00% GC)
  median time:      216.096 ms (0.00% GC)
  mean time:        216.361 ms (0.00% GC)
  maximum time:     220.369 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1

Here, the kernel trounces OpenBLAS $-$ but thanks to the massive throughput, fighting memory bandwidth is going to be an uphill battle.