I have learned a lot about lower level programming! I may return to a couple of older posts, to ellaborate or add corrections.
But for now, let’s explore something new.
As a challenge, I wanted to touch on matrix multiplication written in Julia.
There are two chief advantages of this:
- The ability to get both generic and optimal code.
- The ability to get optimal performance at small and hopefully large sizes.
Libraries like Intel MKL and OpenBLAS (the default in Julia) are slow at small sizes. While MKL is proprietary, we can look at the source code of OpenBLAS and see that it has assembly kernels written for many different architectures.
Each time a new architecture comes out, before it’s supported, new kernels must be added, or old ones adapted! That is hard to maintain.
If we could rely on the compilers, which also get updates for each archtecture, it could make our lives a lot easier.
Of course, compilers can’t do everything, so we can use CpuId.jl
to provide information about the CPU we’re running on.
I will be assuming you’re using an x86_64 processor (almost all laptop and desktop procesors are), but it would be good to support ARM or POWER too. I’m not familiar with thee however.
In this post, I will focus on optimizing a “compute kernel”. The compute kernel is a small core calculation, an atom that we build the larger operations out of.
Say we have $\textbf{D} = \textbf{A} \textbf{X}$, where $\textbf{A}$ is $M\times N$, $\textbf{X}$ is $N\times P$, and $\textbf{D}$ is $M\times P$. We can divide these into blocks such that:
$\begin{bmatrix}\textbf{D}_{1,1} & \textbf{D}_{1,2}\\\textbf{D}_{2,1} & \textbf{D}_{2,2}\end{bmatrix} = \begin{bmatrix}\textbf{A}_{1,1} & \textbf{A}_{1,2}\\\textbf{A}_{2,1} & \textbf{A}_{2,2}\end{bmatrix} \begin{bmatrix}\textbf{X}_{1,1} & \textbf{X}_{1,2}\\\textbf{X}_{2,1} & \textbf{X}_{2,2}\end{bmatrix}\\ = \begin{bmatrix} \textbf{A}_{1,1}\textbf{X}_{1,1} + \textbf{A}_{1,2}\textbf{X}_{2,1}& \textbf{A}_{1,1}\textbf{X}_{1,2} + \textbf{A}_{1,2}\textbf{X}_{2,2}\\ \textbf{A}_{2,1}\textbf{X}_{1,1} + \textbf{A}_{2,2}\textbf{X}_{2,1}& \textbf{A}_{2,1}\textbf{X}_{1,2} + \textbf{A}_{2,2}\textbf{X}_{2,2}\end{bmatrix}$
so in this way, we can break a larger computation down into blocks matching our kernel(s).
Combining these kernels into a larger operations will be the subject of another post — dealing with memory bottlenecks for large matrices is a challenge.
For now, let us focus just on the kernel itself. That is also the part most relevant to fast operations of small matrices.
To begin, lets consider a single core of an x86_64 processor:
- They have 16 (most) or 32 (AVX-512) registers.
- Each register holds 16 (old), 32 (most), or 64 (AVX-512) bytes.
- One instruction can load a chunk of data into a register, or broadcast it accross. Gather operations, loading bits of data that aren’t side by side, are comparatively very slow.
- One instruction can multiply, add, or a fused multiply-add (FMA) on all contents of a register.
Throughput for operations is complicated, but as a rule of theumb: the wider the vectors the better. Commonly cited caveats are (GCC follows these caveats by default, needing -mprefer-vector-width=NUMBER_OF_BITS
to override it):
- Ryzen’s throughput is equal for 32 byte and 16 byte fma instructions, because the cores have 16 byte units that work together to act like a 32 byte unit. Intel processors have full sized units.
- Intel AVX-512 processors run at a slower clock rate when doing a lot of AVX-512 operations, because they generate a lot of heat.
However, in both cases, the larger vectors are still dramatically faster in numerical code that is easy to vectorize. Here, easy means “doesn’t need much code to arrange and line up data in the correct way”. Most linear algebra operations do not. Matrix multiplication in particular benefits from using far less memory operations the wider the vectors (and the more registers it holds).
To check some of these features of your CPU, you can use CpuId.jl
as follows:
using CpuId
REGISTER_SIZE = simdbytes()
REGISTER_SIZE, cpufeature(CpuId.AVX512F), cpufeature(CpuId.FP256)
This means on the computer I am currently writing this with, I have 32 byte (256 bit) SIMD instructions, but do not have the AVX512 foundation, and I don’t have native 256 bit floating point instructions — although they can still be used. That is because I am using a Ryzen processor:
versioninfo()
simdbytes() ÷ sizeof(Float64) # How many Float64 fit in a single register
simdbytes
refers to the SIMD (single instruction, multiple data) vector width, while cpufeature(CpuId.AVX512F)
checks if we have the AVX-512 foudnation extension, that doubles the number of registers from 16 to 32 (alongside other benefits). We don’t.
Now, moving onto how we can organize our operations, recall that the element $D_{m,p} = \sum_{n=1}^N A_{m,n} * X_{n,p}$.
That is, for an element of the answer, we move across the columns of $\textbf{A}$ and down the rows of $\textbf{X}$.
I will now be assuming a column major order, the standard in languages like Julia, R, MATLAB, and Fortran. Languages like C, C++, Rust, and Python are instead row major, but the same strategy would apply for $\textbf{D}^* = \textbf{X}^*\textbf{A}^*$, where $\textbf{D}^*$ is $P\times M$, $\textbf{X}^*$ is $P\times N$, and $\textbf{A}^*$ is $N\times M$.
Given the column major order, the elements in a column are contiguous, and can be efficiently loaded into or stored from a vector register together, while the elements in a row are not.
Our objective is to minimize how much we’re doing anything other than arithmetic operations. Part of that means loading and storing contents from registers as infrequently as possible, and every time we do load data from $\textbf{A}$ or $\textbf{X}$, we want to do as much as we can with it before loading something else in its place.
Lets define the shape of a kernel as an $m\times p$ block, corresponding for example to $\textbf{A}_{1,1}\textbf{X}_{1,1}$ above $-$ a sub-block we’re calculating on our way to getting the full answer.
Because the columns are contiguous, we want these columns to be made up of vectors stored in the registers. Naturally, $m$ should be some multiple of the size of our registers for the given data type. For reasons discussed later, also of the cachineline size, given by cachelinesize()
in CpuId; $64$ bytes on this computer, hence we want m to be a multiple of 8.
Doing a little algebra, we see that:
$
D_{1,1} = A_{1,1}X_{1,1} + A_{1,2}X_{2,1} + A_{1,3}X_{3,1} +\dots\\
D_{2,1} = A_{2,1}X_{1,1} + A_{2,2}X_{2,1} + A_{2,3}X_{3,1} +\dots\\
D_{3,1} = A_{3,1}X_{1,1} + A_{3,2}X_{2,1} + A_{3,3}X_{3,1} +\dots\\
D_{4,1} = A_{4,1}X_{1,1} + A_{4,2}X_{2,1} + A_{4,3}X_{3,1} +\dots\\
D_{5,1} = A_{5,1}X_{1,1} + A_{5,2}X_{2,1} + A_{5,3}X_{3,1} +\dots\\
D_{6,1} = A_{6,1}X_{1,1} + A_{6,2}X_{2,1} + A_{6,3}X_{3,1} +\dots\\
D_{7,1} = A_{7,1}X_{1,1} + A_{7,2}X_{2,1} + A_{7,3}X_{3,1} +\dots\\
D_{8,1} = A_{8,1}X_{1,1} + A_{8,2}X_{2,1} + A_{8,3}X_{3,1} +\dots\\
$
Therefore, if we want to calculate the first column of $\textbf{D}$, we need to load vectors from the columns of $\textbf{A}$ and broadcast elements from $\textbf{X}$. We can reuse broadcasts from $\textbf{X}$ for each vector loaded per column of $\textbf{A}$. That is, we get an operation per $\lceil\frac{m}{\text{vector length}}\rceil$.
Similarly, the second column of $\textbf{D}$ looks like:
$
D_{1,2} = A_{1,1}X_{1,2} + A_{1,2}X_{2,2} + A_{1,3}X_{3,2} +\dots\\
D_{2,2} = A_{2,1}X_{1,2} + A_{2,2}X_{2,2} + A_{2,3}X_{3,2} +\dots\\
D_{3,2} = A_{3,1}X_{1,2} + A_{3,2}X_{2,2} + A_{3,3}X_{3,2} +\dots\\
D_{4,2} = A_{4,1}X_{1,2} + A_{4,2}X_{2,2} + A_{4,3}X_{3,2} +\dots\\
D_{5,2} = A_{5,1}X_{1,2} + A_{5,2}X_{2,2} + A_{5,3}X_{3,2} +\dots\\
D_{6,2} = A_{6,1}X_{1,2} + A_{6,2}X_{2,2} + A_{6,3}X_{3,2} +\dots\\
D_{7,2} = A_{7,1}X_{1,2} + A_{7,2}X_{2,2} + A_{7,3}X_{3,2} +\dots\\
D_{8,2} = A_{8,1}X_{1,2} + A_{8,2}X_{2,2} + A_{8,3}X_{3,2} +\dots\\
$
Now, we see that we can reuse loads from $\textbf{A}$ across columns of $\textbf{D}$.
For each load of $\textbf{A}$, we get an operation per column of the kernel, $m$.
This gives us a constrained optimization problem. We are constrained by the number of CPU registers, while trying to maximize the number of operations per load.
Choosing $m=8,p=6$, for each 2 loads from $\textbf{A}$ and 6 broadcasts from $\textbf{X}$, we get 12 multiplications, or 1.5 multiplications per memory operatinos.
This also requires $2*6=12$ registers of elements from $\textbf{D}$. As we move from $n=1$ to $n=N$, we also need 2 from $\textbf{A}$, and ideally a couple broadcasts from $\textbf{X}$. This totals to exactly 16 registers.
Choosing $m=16,p=2$ gives us 4 loads and 2 broadcasts per 8 arithmetic operations, or $\frac{8}{6}\approx 1.33$. This takes up $4*2+4+2 = 14$ registers.
To implement a kernel, we could rely on LLVM to autovectorize.
In my experience, it does a good job correctly using vector instructions, but seems to result in more unnecessary vmovapd
instructions to move vectors from one register to another unnecessarilly. It does not seem to do this when we’re being explicit, and because we know exactly what we want to happen, we may as well not rely on the autovectorizer. SIMD.jl
allows us to use LLVM vector intrinsics.
For now, we’ll also simply implement them for the StaticArrays.jl
‘s MMatrix
type, a mutable matrix of static size (known at compile time). Unfortunately, SIMD.jl
only supports loading from the basic Array
type and pointers (or broadcasting scalers), forcing us to use pointers here.
using SIMD, StaticArrays, LinearAlgebra
function kernel!(D::MMatrix{8,6,Float64},
A::MMatrix{8,N,Float64},
X::MMatrix{N,6,Float64}) where N
pD, pA = pointer(D), pointer(A)
V = Vec{8,Float64}
Dx_1 = vload(V, pD + 64*0) # 8 bytes/element * 8 elements/column
Dx_2 = vload(V, pD + 64*1)
Dx_3 = vload(V, pD + 64*2)
Dx_4 = vload(V, pD + 64*3)
Dx_5 = vload(V, pD + 64*4)
Dx_6 = vload(V, pD + 64*5)
@inbounds for n ∈ 1:N # Here, we increase N.
vA = vload(V, pA + (n-1)*64)
Dx_1 = fma(vA, V(X[n,1]), Dx_1)
Dx_2 = fma(vA, V(X[n,2]), Dx_2)
Dx_3 = fma(vA, V(X[n,3]), Dx_3)
Dx_4 = fma(vA, V(X[n,4]), Dx_4)
Dx_5 = fma(vA, V(X[n,5]), Dx_5)
Dx_6 = fma(vA, V(X[n,6]), Dx_6)
end
vstore(Dx_1, pD + 64*0)
vstore(Dx_2, pD + 64*1)
vstore(Dx_3, pD + 64*2)
vstore(Dx_4, pD + 64*3)
vstore(Dx_5, pD + 64*4)
vstore(Dx_6, pD + 64*5)
D
end
const D8_6 = @MMatrix zeros(8,6)
const A8_16 = @MMatrix randn(8,16)
const X16_6 = @MMatrix randn(16,6)
kernel!(D8_6, A8_16, X16_6)
mul!(D8_6, A8_16, X16_6) # Compare answer with StaticArrays.jl's mul.
Note that the LinearAlgebra and StaticArrays mul!
function calculates $\textbf{D} = \textbf{AX}$, while our kernel calculates $\textbf{D} = \textbf{D} + \textbf{AX}$, because the blockwise summation looks mostly like the latter.
Nonetheless, we can time it compared to the StaticArrays function:
using BenchmarkTools
@benchmark kernel!(D8_6, A8_16, X16_6)
@benchmark mul!(D8_6, A8_16, X16_6)
and see quite an improvement.
Let us also briefly take an aside to consider C and a Fortran implementations.
This C code is comparable to the Julia code, relying on vector intrinsics:
#include<immintrin.h>
void kernel8x16x6(double* D, double* A, double* X, long N){
__m256d D11, D12, D21, D22, D31, D32, D41, D42, D51, D52, D61, D62;
__m256d A1, A2, X1, X2, X3, X4, X5, X6;
D11 = _mm256_loadu_pd(D);
D12 = _mm256_loadu_pd(D + 4);
D21 = _mm256_loadu_pd(D + 8);
D22 = _mm256_loadu_pd(D + 12);
D31 = _mm256_loadu_pd(D + 16);
D32 = _mm256_loadu_pd(D + 20);
D41 = _mm256_loadu_pd(D + 24);
D42 = _mm256_loadu_pd(D + 28);
D51 = _mm256_loadu_pd(D + 32);
D52 = _mm256_loadu_pd(D + 36);
D61 = _mm256_loadu_pd(D + 40);
D62 = _mm256_loadu_pd(D + 44);
for(long i = 0; i < N; i++){
A1 = _mm256_loadu_pd(A + 8*i);
A2 = _mm256_loadu_pd(A + 8*i + 4);
X1 = _mm256_broadcast_sd(X+i);
D11 = _mm256_fmadd_pd(A1, X1, D11);
D12 = _mm256_fmadd_pd(A2, X1, D12);
X2 = _mm256_broadcast_sd(X+(N+i));
D21 = _mm256_fmadd_pd(A1, X2, D21);
D22 = _mm256_fmadd_pd(A2, X2, D22);
X3 = _mm256_broadcast_sd(X+(2*N+i));
D31 = _mm256_fmadd_pd(A1, X3, D31);
D32 = _mm256_fmadd_pd(A2, X3, D32);
X4 = _mm256_broadcast_sd(X+(3*N+i));
D41 = _mm256_fmadd_pd(A1, X4, D41);
D42 = _mm256_fmadd_pd(A2, X4, D42);
X5 = _mm256_broadcast_sd(X+(4*N+i));
D51 = _mm256_fmadd_pd(A1, X5, D51);
D52 = _mm256_fmadd_pd(A2, X5, D52);
X6 = _mm256_broadcast_sd(X+(5*N+i));
D61 = _mm256_fmadd_pd(A1, X6, D61);
D62 = _mm256_fmadd_pd(A2, X6, D62);
}
_mm256_storeu_pd(D, D11);
_mm256_storeu_pd(D + 4, D12);
_mm256_storeu_pd(D + 8, D21);
_mm256_storeu_pd(D + 12, D22);
_mm256_storeu_pd(D + 16, D31);
_mm256_storeu_pd(D + 20, D32);
_mm256_storeu_pd(D + 24, D41);
_mm256_storeu_pd(D + 28, D42);
_mm256_storeu_pd(D + 32, D51);
_mm256_storeu_pd(D + 36, D52);
_mm256_storeu_pd(D + 40, D61);
_mm256_storeu_pd(D + 44, D62);
}
A couple comments:
- I’d read that larger vectors are not automatically split into multiple instances of smaller vectors in an efficient way, unlike the LLVM intrinsics in Julia. Therefore, we need to split each column in two chunks. I have not yet tested if this is still true to verify.
- We need to remember substantially more. Defining the vector type we want in Julia is staightforward, eg
Vec{4,Float64}
specifies a vector of 4 doubles, and then we can use this type definition to dispatch on the correct functions. - I used “long” for the integet values, because
Int
in Julia defaults to 64 bit integers.
For Fortran I didn’t use vector intrinsics.
The do
loop (ie, for
loop in other languages) failed to optimize correctly using the compilers and flags I tried(gfortran, flang, ifort). It either stored and loaded from registers unnecessarily, or tried to vectorize across the loops instead of within loops.
I filed an issue on gcc’s bugzilla here (despite the bug’s name, I’d prefer not using funroll-loops):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
Manaully unrolling the loop probably really isn’t worth it. I’ll spare you the full code (you can download it from the above link), but it looks like this:
subroutine mulkernel(D, A, X)
real(8), dimension(8,16), intent(in) :: A
real(8), dimension(16,6), intent(in) :: X
real(8), dimension(8,6), intent(out) :: D
D(1:8,1) = D(1:8,1) + A(1:8,1) * X(1,1)
D(1:8,2) = D(1:8,2) + A(1:8,1) * X(1,2)
D(1:8,3) = D(1:8,3) + A(1:8,1) * X(1,3)
D(1:8,4) = D(1:8,4) + A(1:8,1) * X(1,4)
D(1:8,5) = D(1:8,5) + A(1:8,1) * X(1,5)
D(1:8,6) = D(1:8,6) + A(1:8,1) * X(1,6)
D(1:8,1) = D(1:8,1) + A(1:8,2) * X(2,1)
D(1:8,2) = D(1:8,2) + A(1:8,2) * X(2,2)
D(1:8,3) = D(1:8,3) + A(1:8,2) * X(2,3)
D(1:8,4) = D(1:8,4) + A(1:8,2) * X(2,4)
D(1:8,5) = D(1:8,5) + A(1:8,2) * X(2,5)
D(1:8,6) = D(1:8,6) + A(1:8,2) * X(2,6)
simply repeating these blocks 16 times, for n = 1:16
. That means unlike the Julia and C code, we awkwardly need N=16. Unrolling bloats code size here, without real benefit because each loop already has 12 fma instructions.
I compiled with (gcc version 8.1):
gcc -O2 -march=haswell -shared -fPIC loopkernel.c -o libcloopkernel.so
gfortran -Ofast -march=haswell -shared -fPIC fortunrolled.f90 -o libfunrolledkernel.so
The architecture is znver1, but choosing Haswell often results in much faster code — in particular, it does here. -mprefer-vector-width=256
gets you some of the way, but not all — it still ended up using a bunch of silly memory operations that made the Fortran code run 2x slower. So haswell it is.
fpath = "/home/chris/Documents/progwork/julia/matmul"
const mullib = joinpath(fpath, "libfunrolledkernel.so")
const cmullib = joinpath(fpath, "libcloopkernel.so")
function cmul!(D::MMatrix{8,6,Float64},A::MMatrix{8,N,Float64},X::MMatrix{N,6,Float64}) where N
ccall((:kernel8x16x6, cmullib), Cvoid, (Ptr{Float64},Ptr{Float64},Ptr{Float64}, Clong), D, A, X, N)
D
end
function fmul!(D::MMatrix{8,6,Float64},A::MMatrix{8,16,Float64},X::MMatrix{16,6,Float64})
ccall((:__kernels_MOD_mulkernel, mullib), Cvoid, (Ptr{Float64},Ptr{Float64},Ptr{Float64}), D, A, X)
D
end
D8_6 .= 0; cmul!(D8_6, A8_16, X16_6)
@benchmark cmul!(D8_6, A8_16, X16_6)
D8_6 .= 0; fmul!(D8_6, A8_16, X16_6)
@benchmark fmul!(D8_6, A8_16, X16_6)
Julia’s performance here was fairly close. I’ll use that as justification to focus on Julia for now.
Let us now consider the 16xNx2 kernel:
function kernel!(D::MMatrix{16,2,Float64},
A::MMatrix{16,N,Float64},
X::MMatrix{N,2,Float64}) where N
pD, pA = pointer(D), pointer(A)
V = Vec{16,Float64}
vA = vload(V, pA)
@inbounds begin
Dx1 = vload(V, pD + 128*0)
Dx2 = vload(V, pD + 128*1)
for n ∈ 1:N # Here, we increase N.
vA = vload(V, pA + (n-1)*128) # 8 bytes/element * 16 elements/column
Dx1 = fma(vA, V(X[n,1]), Dx1)
Dx2 = fma(vA, V(X[n,2]), Dx2)
end
end
vstore(Dx1, pD + 128*0)
vstore(Dx2, pD + 128*1)
D
end
const D16_2 = @MMatrix zeros(16,2)
const A16_24 = @MMatrix randn(16,24)
const X24_2 = @MMatrix randn(24,2)
kernel!(D16_2, A16_24, X24_2)
mul!(D16_2, A16_24, X24_2)
@benchmark kernel!(D16_2, A16_24, X24_2)
@benchmark mul!(D16_2, A16_24, X24_2)
As expected, the $8\times6$ kernel seems to perform better.
Here, I tried to keep the size constant. Ie, $8*16*6=16*24*2=768$. This is the number of multiplication operations needed, ie each of the $m\times p$ entries of the output matrix is a dot product of the vectors of length $n$ that make up the rows of $\textbf{A}$ and columns of $\textbf{B}$.
We can look at the generated assembly:
@code_native kernel!(D8_6, A8_16, X16_6)
At L144
, we see vmovapd %ymm11, %ymm12
which is probably unnecessary (moving contents from register 11 to register 12), and gets executed once per iteration of the for loop.
It seems to happen because of the after the last broadcast, there is a vfmadd instruction that writes onto register number 11, instead of register 12 where it is actually trying to accumulate from D:
vfmadd213pd %ymm12, %ymm14, %ymm11
Interestingly, I did experiment with lightly editing the above assembly and compiling it with gcc, and it was about as fast as the C code.
Of course, we’re less interested in the kernels themselves than we are multiplying larger dense matrices. Let us compare to OpenBLAS:
@show BLAS.vendor()
BLAS.set_num_threads(1)
const D = Matrix{Float64}(undef,800,600)
const A, X = randn(800,1600), randn(1600,600)
@benchmark mul!(D, A, X)
Our kernel is about the same speed. Assuming O(n^3)
scaling, the above should take $(10^2)^3=10^6$ times longer, or about 53.5 ms (we get to swap ns
for ms
), as we see.
For what it’s worth, calling the kernel a million times here mostly seems to improve consistency:
@benchmark foreach(i -> kernel!(D8_6, A8_16, X16_6), 1:10^6)
but when we get to it, we are going to see a bigger improvement from increasing N
to decrease the relative frequency of reads and writes from D
. Getting it so that we aren’t bottlenecked by memory will be the bigger problem.
Now, the advantages of writing in Julia are that it lets us write generic optimized code. Lets break this down into two problems: functions to choose the kernel size, and making the pattern agnostic to size of the inputs.
We’ll give a simple example of the latter with MMatrices here, with one that accounts for strides across columns when we put things together later.
Here, I am $-$ like in C $-$ defining the SIMD vectors to be of the exact number of bytes, to save the compiler from the work of splitting them. When I didn’t do this initially, some of the kernels took over a second to compile, and a few even failed to vectorize correctly.
using Base.Cartesian
@generated function generic_kernel!(D::MMatrix{M,P,T},
A::MMatrix{M,N,T},
X::MMatrix{N,P,T}) where {M,P,N,T}
MSoT = M*sizeof(T)
L = REGISTER_SIZE ÷ sizeof(T)
Q, r = divrem(M, L) #Assuming M is a multiple of L
r == 0 || throw("Number of rows $M not a multiple of register size: $REGISTER_SIZE.")
V = Vec{L,T}
quote
pD, pA = pointer(D), pointer(A)
@nexprs $P p -> @nexprs $Q q -> Dx_p_q = vload($V, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
@inbounds for n ∈ 1:N
@nexprs $P p -> begin
vX = $V(X[n,p])
@nexprs $Q q -> begin
vA_q = vload($V, pA + (n-1)*$MSoT + $REGISTER_SIZE*(q-1))
Dx_p_q = fma(vA_q, vX, Dx_p_q)
end
end
end
@nexprs $P p -> @nexprs $Q q -> vstore(Dx_p_q, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
D
end
end
D8_6 .= 0; generic_kernel!(D8_6, A8_16, X16_6)
@benchmark generic_kernel!(D8_6, A8_16, X16_6)
If you’re unfamiliar with generated functions, try removing the @generated
macro, and calling the function, and you’ll see a quoted julia expression.
In a generated function, you can use type information of the arguments in building a Julia expression, which the function then returns. The first time a generated function gets called for a combination of input types, it gets run, and then this quoted expression is what gets compiled.
You can also use @macroexpand
to explore the behavior of Base.Cartesian
‘s @nexprs
. For examply, try
@macroexpand @nexprs 3 i -> @nexprs 2 j -> d_i_j = (i,j)
This is a convenient way of writing unrolled expressions.
# const REGISTER_SIZE = CpuId.simdbytes() # defined earlier
const REGISTER_COUNT = CpuId.cpufeature(CpuId.AVX512F) ? 32 : 16
const CACHELINE_SIZE = CpuId.cachelinesize()
function kernel_size_summary(::Type{T} = Float64,
register_size = REGISTER_SIZE, register_count = REGISTER_COUNT) where T
t_size = sizeof(T)
num_per_register = register_size ÷ t_size
max_total = num_per_register * register_count
cache_line = CACHELINE_SIZE ÷ t_size
num_cache_lines = cld(max_total, cache_line)
summary = Matrix{Float64}(undef, 5, num_cache_lines)
for num_row_cachelines ∈ 1:num_cache_lines
num_rows = num_row_cachelines * cache_line
a_loads = cld(num_rows, num_per_register)
num_cols = (register_count - a_loads - 2) ÷ a_loads
if num_cols == 0
println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
return summary[:, 1:num_row_cachelines-1]
end
summary[:, num_row_cachelines] .= (num_rows, num_cols, num_rows * num_cols, num_cols + a_loads, num_rows * num_cols / (num_cols + a_loads))
end
println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
summary
end
kernel_size_summary(Float64)
kernel_size_summary(Float32)
kernel_size_summary(Complex{Float64})
This suggests 6 columns is generally the way to go.
In practice, we can have the function simply return the combination that produces the highest value in the bottom row.
We’ll focus on performance of structs of multiple numbers like Complex{Float64}
and DualNumbers later.
I haven’t looked at these yet, but I imagine StructsOfArrays.jl
may be the way to go for performance. Switching to a computer with AVX512:
versioninfo()
using CpuId
simdbytes(), cpufeature(CpuId.AVX512F), cpufeature(CpuId.FP256)
const REGISTER_SIZE = CpuId.simdbytes()
const REGISTER_COUNT = CpuId.cpufeature(CpuId.AVX512F) ? 32 : 16
const CACHELINE_SIZE = CpuId.cachelinesize()
function kernel_size_summary(::Type{T} = Float64,
register_size = REGISTER_SIZE, register_count = REGISTER_COUNT) where T
t_size = sizeof(T)
num_per_register = register_size ÷ t_size
max_total = num_per_register * register_count
cache_line = CACHELINE_SIZE ÷ t_size
num_cache_lines = cld(max_total, cache_line)
summary = Matrix{Float64}(undef, 5, num_cache_lines)
for num_row_cachelines ∈ 1:num_cache_lines
num_rows = num_row_cachelines * cache_line
a_loads = cld(num_rows, num_per_register)
num_cols = (register_count - a_loads - 2) ÷ a_loads
if num_cols == 0
println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
return summary[:, 1:num_row_cachelines-1]
end
summary[:, num_row_cachelines] .= (num_rows, num_cols, num_rows * num_cols, num_cols + a_loads, num_rows * num_cols / (num_cols + a_loads))
end
println("A * X = D; nrow(D), ncol(D), length(D), NumRegeristersLoaded")
summary
end
kernel_size_summary(Float64)
kernel_size_summary(Float32)
kernel_size_summary(Complex{Float64})
using Base.Cartesian, SIMD, StaticArrays, BenchmarkTools
@generated function generic_kernel!(D::MMatrix{M,P,T},
A::MMatrix{M,N,T},
X::MMatrix{N,P,T}) where {M,P,N,T}
MSoT = M*sizeof(T)
L = REGISTER_SIZE ÷ sizeof(T)
Q, r = divrem(M, L) #Assuming M is a multiple of L
r == 0 || throw("Number of rows $M not a multiple of register size: $REGISTER_SIZE.")
V = Vec{L,T}
quote
pD, pA = pointer(D), pointer(A)
@nexprs $P p -> @nexprs $Q q -> Dx_p_q = vload($V, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
@inbounds for n ∈ 1:N
@nexprs $P p -> begin
vX = $V(X[n,p])
@nexprs $Q q -> begin
vA_q = vload($V, pA + (n-1)*$MSoT + $REGISTER_SIZE*(q-1))
Dx_p_q = fma(vA_q, vX, Dx_p_q)
end
end
end
@nexprs $P p -> @nexprs $Q q -> vstore(Dx_p_q, pD + $REGISTER_SIZE*(q-1) + $MSoT*(p-1))
D
end
end
const D16_14 = @MMatrix zeros(16,14)
const A16_25 = @MMatrix randn(16,25)
const X25_14 = @MMatrix randn(25,14)
const D40_5 = @MMatrix zeros(40,5)
const A40_28 = @MMatrix randn(40,28)
const X28_5 = @MMatrix randn(28,5)
@benchmark generic_kernel!(D16_14, A16_25, X25_14)
@benchmark generic_kernel!(D40_5, A40_28, X28_5)
const sD32_14 = @MMatrix zeros(Float32, 32,14)
const sA32_25 = @MMatrix randn(Float32, 32,25)
const sX25_14 = @MMatrix randn(Float32, 25,14)
const sD80_5 = @MMatrix zeros(Float32, 80,5)
const sA80_28 = @MMatrix randn(Float32, 80,28)
const sX28_5 = @MMatrix randn(Float32, 28,5)
@benchmark generic_kernel!(sD32_14, sA32_25, sX25_14)
@benchmark generic_kernel!(sD80_5, sA80_28, sX28_5)
This difference looks very small. Further testing is probably necessary to see if the $P=5$ kernel is actually optimal.
Also, worth noting the awesome performance of AVX512 here. Scaling up the minimum time for the generic kernel run on Ryzen:
53.064 * (16 / 8) * (25 / 16) * (14 / 6)
That is an estimated roughly 390ns runtime, vs the little more than 100ns we observed.
const Dbig = randn(1600, 1400)
const Abig = randn(1600, 2500)
const Xbig = randn(2500, 1400)
using LinearAlgebra
@show BLAS.vendor()
BLAS.set_num_threads(1)
@benchmark mul!(Dbig, Abig, Xbig)
Here, the kernel trounces OpenBLAS $-$ but thanks to the massive throughput, fighting memory bandwidth is going to be an uphill battle.