SIMDArrays

Right now my work has been focusing on small dimensional problems, sometimes involving billions of operations with tiny matrices. BLAS libraries like MKL and especially OpenBLAS perform poorly with small matrices, which is one of the primary motivations for “rolling my own”. StaticArrays.jl is a fairly solid library of unrolled expressions, more like my earlier blog posts, much faster for small sizes and falling back to BLAS operations above a modest size.
However, we can at least much StaticArrays at small sizes, and do dramatically better beyond that $-$ beating BLAS libraries up to a much larger size, and hopefully matching beyond that (if not, we can also simply fall back).

Another motivation is being able to perform operations like $\textbf{X} = \textbf{A}(\textbf{Y} – \textbf{B})$, which we can calculate more efficiently by condensing it into a single operation.

So far I have not yet implemented prefetching (and no multithreading), but performance is good at relatively small sizes.
I’ve also set the strides to be a multiple of register size, to increase the performance at small sizes.

I will compare to MMatries from StaticArrays.jl. It defaults to different methods at different sizes:
https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/matrix_multiply.jl#L93

@generated function _mul(Sa::Size{sa}, Sb::Size{sb}, a::Union{SizedMatrix{T}, MMatrix{T}, MArray{T}}, b::Union{SizedMatrix{T}, MMatrix{T}, MArray{T}}) where {sa, sb, T <: BlasFloat}
S = Size(sa, sb)

# Heuristic choice between BLAS and explicit unrolling (or chunk-based unrolling)
if sa*sa*sb >= 14*14*14
return quote
@_inline_meta
C = similar(a, T, S) # The Ses were actually interpolated into the quote
mul_blas!(S, C, Sa, Sb, a, b) # but this stopped the quote from printing correctly.
return C
end
elseif sa*sa*sb < 8*8*8
return quote
@_inline_meta
return mul_unrolled(Sa, Sb, a, b)
end
elseif sa <= 14 && sa <= 14 && sb <= 14
return quote
@_inline_meta
return similar_type(a, T, $S)(mul_unrolled_chunks(Sa, Sb, a, b)) end else return quote @_inline_meta return mul_loop(Sa, Sb, a, b) end end end  So we can compare to each of these. First, some code to convert SIMDArrays to MMatrices: In : using SIMDArrays, StaticArrays function copyloop!(x, y) # Broadcasts slow to compile at the moment; will fix when I implement broadcasting! @boundscheck size(x) == size(y) || throw(BoundsError()) @inbounds for i ∈ eachindex(x,y) x[i] = y[i] end end function Base.convert(::Type{MMatrix}, A::SIMDArrays.SizedSIMDMatrix{M,N,T}) where {M,N,T} out = MMatrix{M,N,T}(undef) copyloop!(out, A) out end using BenchmarkTools, LinearAlgebra M,N,P = 3,3,3 const D_3_3, A_3_3, X_3_3 = SIMDArrays.randsimd(M,P), SIMDArrays.randsimd(M,N), SIMDArrays.randsimd(N,P) const mD_3_3, mA_3_3, mX_3_3 = convert(MMatrix, D_3_3), convert(MMatrix, A_3_3), convert(MMatrix, X_3_3) @btime mul!(mD_3_3, mA_3_3, mX_3_3) #multiplying StaticArrays MMatrices   8.697 ns (0 allocations: 0 bytes)  Out: 3×3 MArray{Tuple{3,3},Float64,2,9}: 0.597014 1.12347 0.783426 0.994657 0.589675 0.496053 0.591024 0.74759 0.352635 In : @btime mA_3_3 * mX_3_3 #multiplying StaticArrays MMatrices creating stack allocated StaticMatrix   9.778 ns (0 allocations: 0 bytes)  Out: 3×3 SArray{Tuple{3,3},Float64,2,9}: 0.597014 1.12347 0.783426 0.994657 0.589675 0.496053 0.591024 0.74759 0.352635 In : @btime mul!(D_3_3, A_3_3, X_3_3); D_3_3 #multiplying SIMDArrays   3.975 ns (0 allocations: 0 bytes)  Out: 3×3 SIMDArrays.SizedSIMDArray{Tuple{3,3},Float64,2,4,12}: 0.597014 1.12347 0.783426 0.994657 0.589675 0.496053 0.591024 0.74759 0.352635 We can automate creating constants to benchmark with. Normally, we can also interpolate non-constants: In : d_3_3, a_3_3, x_3_3 = D_3_3, A_3_3, X_3_3 md_3_3, ma_3_3, mx_3_3 = mD_3_3, mA_3_3, mX_3_3 @btime mul!($md_3_3, $ma_3_3,$mx_3_3) #multiplying StaticArrays MMatrices

  7.583 ns (0 allocations: 0 bytes)

Out:
3×3 MArray{Tuple{3,3},Float64,2,9}:
0.597014  1.12347   0.783426
0.994657  0.589675  0.496053
0.591024  0.74759   0.352635
In :
@btime $ma_3_3 *$mx_3_3 #multiplying StaticArrays MMatrices creating stack allocated StaticMatrix

  8.274 ns (0 allocations: 0 bytes)

Out:
3×3 SArray{Tuple{3,3},Float64,2,9}:
0.597014  1.12347   0.783426
0.994657  0.589675  0.496053
0.591024  0.74759   0.352635
In :
@btime mul!($d_3_3,$a_3_3, $x_3_3) #multiplying SIMDArrays   9.934 ns (0 allocations: 0 bytes)  But I think the dispatch somehow isn’t getting fully resolved at compile time when I do that. I ran into a peculiar issue where Julia seemed slower than the assembly you get with @code_native: https://discourse.julialang.org/t/julia-one-third-slower-than-ccall-ing-code-native-assembly-compiled-with-gcc/12901 In that thread, I used @code_native to get the assembly, renamed a few variables, and added a header similar to what gcc spits out. Calling these Julia functions was somehow slower than ccall-ing the disassembled and reassembled functions. Later, I discovered that if instead of interpolating values into the benchmark, I declared them constant, the ccall ran at about the same speed, and my Julia function then suddenly matched it. Some preliminary tests support that the const benchmarks better reflect the actual runtime when calling mul! inside of functions. For example, that is exactly how I multiply larger matrices, and they scale with the faster run time before running into memory issues. A word of caution on benchmarking constants: if they are immutable isbits types (rather than references), the compiler will often constant fold them. Benchmarking the immutable StaticArrays this way for example results in roughly 1.5ns estimates. There may be a few places I ought to add an @inline, but I want to approach that in a disciplined way. For now I’ve been simply trusting Julia’s inlining heuristics. I wanted SIMDArrays to compile quickly and generate relatively small code (while being fast). To ease the pain of creating all these constants, I create a function to do it for me: In : function create_constants(M,N,P) dsym, asym, xsym = Symbol(:D_, M,:_,P), Symbol(:A_, M,:_,N), Symbol(:X_, N,:_,P) @eval const$dsym, $asym,$xsym = SIMDArrays.randsimd($M,$P), SIMDArrays.randsimd($M,$N), SIMDArrays.randsimd($N,$P)
@eval const $(Symbol(:m,dsym)),$(Symbol(:m,asym)), $(Symbol(:m,xsym)) = convert(MMatrix,$dsym), convert(MMatrix, $asym), convert(MMatrix,$xsym)
nothing
end
create_constants(4,4,4)

@btime mul!(mD_4_4, mA_4_4, mX_4_4) #multiplying StaticArrays MMatrices

  10.619 ns (0 allocations: 0 bytes)

Out:
4×4 MArray{Tuple{4,4},Float64,2,16}:
1.44059  1.24725   0.847892  0.4459
1.59267  1.58134   1.01729   0.333637
1.07543  0.707921  0.477089  0.391186
1.71865  1.59034   1.04504   0.498463
In :
@btime mul!(D_4_4, A_4_4, X_4_4); D_4_4 #multiplying SIMDArrays

  8.972 ns (0 allocations: 0 bytes)

Out:
4×4 SIMDArrays.SizedSIMDArray{Tuple{4,4},Float64,2,4,16}:
1.44059  1.24725   0.847892  0.4459
1.59267  1.58134   1.01729   0.333637
1.07543  0.707921  0.477089  0.391186
1.71865  1.59034   1.04504   0.498463
In :
create_constants(2,2,2)
@btime mul!(mD_2_2, mA_2_2, mX_2_2) #multiplying StaticArrays MMatrices

  3.583 ns (0 allocations: 0 bytes)

Out:
2×2 MArray{Tuple{2,2},Float64,2,4}:
0.217525  0.617742
0.494758  0.715883
In :
@btime mul!(D_2_2, A_2_2, X_2_2); D_2_2 #multiplying SIMDArrays

  2.908 ns (0 allocations: 0 bytes)

Out:
2×2 SIMDArrays.SizedSIMDArray{Tuple{2,2},Float64,2,2,4}:
0.217525  0.617742
0.494758  0.715883

With 2x2 matrices, we don’t see any difference in runtime. SIMDArrays is not pulling any special tricks there, and the compiler may be smart enough to mostly get it right for the MMatrix.
However, beyond that we start seeing an advantage for SIMDArrays $-$ as we saw for the 3×3 and 4×4.

This is pronounced for 5×5 matrices. 5 rows for $\textbf{D}$ and $\textbf{A}$ is an awkward number to vectorize:

In :
create_constants(5,5,5)
@btime mul!(mD_5_5, mA_5_5, mX_5_5) #multiplying StaticArrays MMatrices

  20.375 ns (0 allocations: 0 bytes)

Out:
5×5 MArray{Tuple{5,5},Float64,2,25}:
1.68442  0.506134  1.35581  1.35819  1.26924
2.02233  1.00744   1.94574  1.8989   1.64038
1.1012   0.81127   1.29916  1.20106  1.38465
1.56722  0.899125  1.69413  1.65455  1.73012
1.61973  0.38083   1.31597  1.1869   1.34235
In :
@btime mul!(D_5_5, A_5_5, X_5_5); D_5_5 #multiplying SIMDArrays

  6.454 ns (0 allocations: 0 bytes)

Out:
5×5 SIMDArrays.SizedSIMDArray{Tuple{5,5},Float64,2,8,40}:
1.68442  0.506134  1.35581  1.35819  1.26924
2.02233  1.00744   1.94574  1.8989   1.64038
1.1012   0.81127   1.29916  1.20106  1.38465
1.56722  0.899125  1.69413  1.65455  1.73012
1.61973  0.38083   1.31597  1.1869   1.34235

Which is why under the hood SIMDArrays adds zeros as bufer, to fill out the rest of the vector. Here is the data underlying the SIMDArray:

In :
D_5_5.data

Out:
(1.6844180822310462, 2.022333347986847, 1.1011990548690336, 1.567220104087964, 1.619727683656056, 0.0, 0.0, 0.0, 0.506133934128852, 1.0074422314972424, 0.811270408701846, 0.899125421174424, 0.3808302175141171, 0.0, 0.0, 0.0, 1.3558118073661667, 1.9457441204894732, 1.2991588123604578, 1.694129984532196, 1.3159696099091744, 0.0, 0.0, 0.0, 1.3581934696063334, 1.898901296667697, 1.2010567621831534, 1.654553907379835, 1.1868992962898552, 0.0, 0.0, 0.0, 1.2692361051364576, 1.6403767281753225, 1.3846511325720616, 1.730116783332563, 1.3423498222539922, 0.0, 0.0, 0.0)

Unfortunately, we have to initialize the buffer to 0.0; uninitizlied memory often contains subnormals, which will result in very slow calculations!
I did not try set_zero_subnormals(true), because I didn’t want to enforce a particular global state when someone loads SIMDArrays. However, if anyone is intersted, that $-$ assuming it also correctly restores performance $-$ would allow for slightly faster allocation, thanks to not not having to initalize the buffer.

This also means that if we round up the number of rows, performance of SIMDArrays is unaffected (while StaticArrays seems to benefit):

In :
create_constants(8,5,5)
@btime mul!(mD_8_5, mA_8_5, mX_5_5) #multiplying StaticArrays MMatrices

  17.303 ns
WARNING: redefining constant X_5_5
WARNING: redefining constant mX_5_5

 (0 allocations: 0 bytes)

Out:
8×5 MArray{Tuple{8,5},Float64,2,40}:
1.40365  0.655686  0.930134  0.923345  1.21126
1.65736  0.576863  0.580237  0.943719  1.33141
1.4589   0.734229  0.689919  1.25286   1.14245
2.19137  1.04887   0.974444  1.75166   1.37686
1.15278  0.424359  0.723352  0.454175  1.24452
1.91782  0.873217  1.13511   1.15201   1.56021
1.81249  1.00741   1.18249   1.55088   1.41598
1.8969   0.892188  1.0249    1.33914   1.1299 
In :
@btime mul!(D_8_5, A_8_5, X_5_5); D_8_5 #multiplying SIMDArrays

  6.576 ns (0 allocations: 0 bytes)

Out:
8×5 SIMDArrays.SizedSIMDArray{Tuple{8,5},Float64,2,8,40}:
1.40365  0.655686  0.930134  0.923345  1.21126
1.65736  0.576863  0.580237  0.943719  1.33141
1.4589   0.734229  0.689919  1.25286   1.14245
2.19137  1.04887   0.974444  1.75166   1.37686
1.15278  0.424359  0.723352  0.454175  1.24452
1.91782  0.873217  1.13511   1.15201   1.56021
1.81249  1.00741   1.18249   1.55088   1.41598
1.8969   0.892188  1.0249    1.33914   1.1299 
In :
D_5_5.data

Out:
(1.6844180822310462, 2.022333347986847, 1.1011990548690336, 1.567220104087964, 1.619727683656056, 0.0, 0.0, 0.0, 0.506133934128852, 1.0074422314972424, 0.811270408701846, 0.899125421174424, 0.3808302175141171, 0.0, 0.0, 0.0, 1.3558118073661667, 1.9457441204894732, 1.2991588123604578, 1.694129984532196, 1.3159696099091744, 0.0, 0.0, 0.0, 1.3581934696063334, 1.898901296667697, 1.2010567621831534, 1.654553907379835, 1.1868992962898552, 0.0, 0.0, 0.0, 1.2692361051364576, 1.6403767281753225, 1.3846511325720616, 1.730116783332563, 1.3423498222539922, 0.0, 0.0, 0.0)

Note the padding of three 0.0 entires at the end of each column of 8 elements.
SIMDArrays is however also much less sensitive to increases in the other axis:

In :
create_constants(5,8,8)
@btime mul!(mD_5_8, mA_5_8, mX_8_8) #multiplying StaticArrays MMatrices

  58.552 ns (0 allocations: 0 bytes)

Out:
5×8 MArray{Tuple{5,8},Float64,2,40}:
1.39768   2.95314  1.7589   3.03318  3.21554  2.5921   2.85997  2.41354
0.887731  1.23225  1.12416  1.78744  1.72041  1.0988   2.00841  1.39155
1.26485   2.27086  1.74075  2.80953  2.56337  1.89558  2.84113  1.91759
1.16735   1.29145  1.25639  1.99119  1.97982  1.37221  1.7136   1.61483
1.71824   2.47468  2.22646  3.10875  3.43757  2.27632  3.24572  2.18747
In :
@btime mul!(D_5_8, A_5_8, X_8_8); D_5_8 #multiplying SIMDArrays

  12.833 ns (0 allocations: 0 bytes)

Out:
5×8 SIMDArrays.SizedSIMDArray{Tuple{5,8},Float64,2,8,64}:
1.39768   2.95314  1.7589   3.03318  3.21554  2.5921   2.85997  2.41354
0.887731  1.23225  1.12416  1.78744  1.72041  1.0988   2.00841  1.39155
1.26485   2.27086  1.74075  2.80953  2.56337  1.89558  2.84113  1.91759
1.16735   1.29145  1.25639  1.99119  1.97982  1.37221  1.7136   1.61483
1.71824   2.47468  2.22646  3.10875  3.43757  2.27632  3.24572  2.18747

While this does cause jumps in runtime whenever we exceed a vector size. From here on, I wont print the resulting matrices $-$ as they’re getting bigger, they start demanding a little too much scrolling.

In :
create_constants(9,9,9)
@btime mul!(mD_9_9, mA_9_9, mX_9_9); #multiplying StaticArrays MMatrices

  82.598 ns (0 allocations: 0 bytes)

In :
@btime mul!(D_9_9, A_9_9, X_9_9); #multiplying SIMDArrays

  26.448 ns (0 allocations: 0 bytes)


Now, looking again at the paths the MMatrix takes, we see that so far we’ve always taken one of these two paths:

elseif sa*sa*sb < 8*8*8
return quote
@_inline_meta
return mul_unrolled(Sa, Sb, a, b)
end
elseif sa <= 14 && sa <= 14 && sb <= 14
return quote
@_inline_meta
return similar_type(a, T, $S)(mul_unrolled_chunks(Sa, Sb, a, b)) end else  Lets look at that last possibility mul_loop(Sa, Sb, a, b), before turning to BLAS for large matrices. We go down the final else if 8 <= M*N*P < 14^3, and at least one of them is greater than 14. Lets compare the boundary, chooseing M=P=8, and let N=14,15. In : create_constants(8,14,8) @btime mul!(mD_8_8, mA_8_14, mX_14_8); #multiplying StaticArrays MMatrices   68.700 ns (0 allocations: 0 bytes)  In : @btime mul!(D_8_8, A_8_14, X_14_8); #multiplying SIMDArrays   24.183 ns (0 allocations: 0 bytes)  In : create_constants(8,15,8) @btime mul!(mD_8_8, mA_8_15, mX_15_8); #multiplying StaticArrays MMatrices   75.213 ns WARNING: redefining constant D_8_8 WARNING: redefining constant mD_8_8   (0 allocations: 0 bytes)  In : @btime mul!(D_8_8, A_8_15, X_15_8); #multiplying SIMDArrays   23.437 ns (0 allocations: 0 bytes)  That wasn’t too exciting, so now time to compare vs BLAS. Here, I am benchmarking vs MKL. On this computer, it is much faster than OpenBLAS, which Julia ships with by default. Again, lets test at the boundary first In : BLAS.set_num_threads(1) # only comparing single threaded performance. BLAS.vendor()  Out: :mkl In : create_constants(14,13,14) @btime mul!(mD_14_14, mA_14_13, mX_13_14); #multiplying StaticArrays MMatrices   460.126 ns (0 allocations: 0 bytes)  In : @btime mul!(D_14_14, A_14_13, X_13_14); #multiplying SIMDArrays   59.680 ns (0 allocations: 0 bytes)  In : create_constants(14,14,14) @btime mul!(mD_14_14, mA_14_14, mX_14_14); #multiplying with MKL   215.117 ns WARNING: redefining constant D_14_14 WARNING: redefining constant mD_14_14   (0 allocations: 0 bytes)  In : @btime mul!(D_14_14, A_14_14, X_14_14); #multiplying SIMDArrays   73.515 ns (0 allocations: 0 bytes)  The heursistic of when to switch from a StaticArrays method to BLAS are probably tuned better for OpenBLAS. Because MKL is much faster, the crossing over point occurs at a much smaller size (smaller than$14*14*14$). Still, the overhead on MKL is much higher than ours. In : create_constants(24,24,24); @btime mul!(mD_24_24, mA_24_24, mX_24_24);#multiplying with MKL   392.312 ns (0 allocations: 0 bytes)  In : @btime mul!(D_24_24, A_24_24, X_24_24); #multiplying SIMDArrays   252.189 ns (0 allocations: 0 bytes)  In : create_constants(49,49,49); @btime mul!(mD_49_49, mA_49_49, mX_49_49);#multiplying with MKL   3.091 μs (0 allocations: 0 bytes)  In : @btime mul!(D_49_49, A_49_49, X_49_49); #multiplying SIMDArrays   2.347 μs (0 allocations: 0 bytes)  In : create_constants(64,64,64); @btime mul!(mD_64_64, mA_64_64, mX_64_64);#multiplying with MKL   4.652 μs (0 allocations: 0 bytes)  In : @btime mul!(D_64_64, A_64_64, X_64_64); #multiplying SIMDArrays   4.408 μs (0 allocations: 0 bytes)  In : create_constants(128,128,128); @btime mul!(mD_128_128, mA_128_128, mX_128_128);#multiplying with MKL   36.007 μs (0 allocations: 0 bytes)  In : @btime mul!(D_128_128, A_128_128, X_128_128); #multiplying SIMDArrays   35.504 μs (0 allocations: 0 bytes)  In : create_constants(200,200,200); @btime mul!(mD_200_200, mA_200_200, mX_200_200);#multiplying with MKL   158.748 μs (0 allocations: 0 bytes)  In : @btime mul!(D_200_200, A_200_200, X_200_200); #multiplying SIMDArrays   141.671 μs (0 allocations: 0 bytes)  From here, we start running out of space in the L2 cache, and performance without prefetching begins to drop. It is still much faster than OpenBLAS with 700×700 matrices though, and therefore reasonably likely to provide a performance benefit even at these sizes when not running multithreaded on some platforms. To view the behavior of SIMDArrays for matrices of various sizes, a useful function is blocking_structure from jBLAS: In : using jBLAS: blocking_structure blocking_structure(204, 200, 200, Float64)  Out: (((16, 129, 14), (204, 200, 200), (204, 200, 200)), 1) The first set of values gives the dimensions ($M\times N\times P$) of the kernel. The second set gives dimensions of the matrix blocks held in the L2 cache. To multiply the matrices, it iterates over the blocks in the L2 cache in the order$N$,$P$,$M$. Note that iteration over$N$here is much shorter here (only cld(200,129) = 2 in the above example), because of how large this dimension inflates in the kernels. Within the$M$and$P$blocks I replace only one of the chunks from$\textbf{A}$or from$\textbf{X}\$ at a time to minimize how much data we have to move into the L1 cache. This means the loop is a little staggered. That is, instead of (1,1),(2,1),(3,1),(1,2),(2,2)... as one may typically have, the looping pattern is (1,1),(2,1),(3,1),(3,2),(1,2)....

jBLAS.jl uses CpuId.jl to provide information on cache and SIMD sizes, so that these algorithms can use that information for determining the blocking pattern. Running blocking_structure on different computers will often yield slightly different answers. To see information on cache sizes, you can use CpuId directly, or simply take the value from jBLAS. To see how many Float64 values fit in each cache level, one can simply:

In :
import jBLAS
jBLAS.CACHE_SIZE .÷ sizeof(Float64)

Out:
(4096, 131072, 1802240)

Note that the third cache level is shared by processors, while the first two are local to each core. That is something one would have to consider when trying to multithread multiplication of large matrices.
These tests were run on Julia 1.0:

In :
versioninfo()

Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)