Here I’ll compare the speed of small matrix multiplication of a few different popular libraries with PaddedMatrices.jl:
I will benchmark the operation $\textbf{C} = \textbf{A} \times \textbf{B}$, where $\textbf{C}\in\mathbb{R}^{M\times N}$, $\textbf{A}\in\mathbb{R}^{M\times K}$, and $\textbf{B}\in\mathbb{R}^{K\times N}$.
I’ll consider every combination of $M\in(3,\ldots,32)$ and $N\in(3,\ldots,32)$ with $K=32$, and use a column major data layout. When the matrix sizes are small enough to avoid memory bandwidth problems, as is the case here, $K$ should effect runtime in a perfectly linear fashion because it should not effect the shape of the kernel, vectorization, or possible register spills, because the outer loop should be over $K$.
See here for an introduction on matmul kernels.
We’re testing 900 fucntions total.
So that all matrix sizes are known at compile time, and these templated C++ libraries can take advantage of this information to optimize the operations, I generate the C++ files programatically:
Mrange = 3:32; Krange = 32; Nrange = 3:32;
function create_eigen_file_string(mrange, krange, nrange)
base_file = """
#include <Eigen/Dense>
using namespace Eigen;
extern "C" {
"""
for m ∈ mrange, k ∈ krange, n ∈ nrange
base_file *= """
\nvoid mul_$(m)x$(k)times$(k)x$(n)(
Matrix<double,$(m),$(n)> &C,
Matrix<double,$(m),$(k)> &A,
Matrix<double,$(k),$(n)> &B){
C = A * B;
}\n
"""
end
base_file * "\n}"
end
function create_blaze_file_string(mrange, krange, nrange)
base_file = """
#include <blaze/Math.h>
#include <blaze/math/StaticMatrix.h>
using blaze::StaticMatrix;
extern "C" {
"""
for m ∈ mrange, k ∈ krange, n ∈ nrange
base_file *= """
\nvoid mul_$(m)x$(k)times$(k)x$(n)(
StaticMatrix<double,$(m)UL,$(n)UL,blaze::columnMajor> &C,
StaticMatrix<double,$(m)UL,$(k)UL,blaze::columnMajor> &A,
StaticMatrix<double,$(k)UL,$(n)UL,blaze::columnMajor> &B){
C = A * B;
}\n
"""
end
base_file * "\n}"
end
Same deal with the Fortran code, although, although I don’t know any of the implementation details for their matmul
function. At the very least, the compiler can use the information to unroll loops:
function create_fortran_file_string(mrange, krange, nrange)
base_file = """
module fixed_size_matmul
use ISO_C_BINDING
implicit none
contains
"""
for m ∈ mrange, k ∈ krange, n ∈ nrange
base_file *= """
\nsubroutine mul_$(m)x$(k)times$(k)x$(n)(C, A, B) bind(C, name = "mul_$(m)x$(k)times$(k)x$(n)")
real(C_double), dimension($m, $n), intent(out) :: C
real(C_double), dimension($m, $k), intent(in) :: A
real(C_double), dimension($k, $n), intent(in) :: B
C = matmul(A, B)
end subroutine mul_$(m)x$(k)times$(k)x$(n)\n
"""
end
base_file * "\n\nend module fixed_size_matmul"
end
I will also benchmark the performance impact of PaddedMatrices’s namesake on a CPU supporting avx-512.
PaddedMatrices is named after the fact that by default it will automatically apply padding between matrix columns, so that the stride between columns is a multiple of SIMD vector width. This should make it easier for the CPU to load columns into registers, and use a single instruction to operate on multiple elements at a time.
Eigen does not support this, but because we’re using PaddedMatrices does, we will just report the padded number of rows when testing padded multiplication. That is, if there are 7 real rows, padded out to 8, we will report 8 rows. Same story for gfortran and MKL JIT.
Blaze, on the other hand, also pads by default, so I will compile blaze both with padding enabled and disabled.
So when the padded matrix is not padded, we call the non-padded blaze, and when it is padded, we call the padded version.
To compile the c++ code, I use the latest version of the g++
compiler:
run(`g++ --version`)
We turn on aggressive compiler optimizations (-O3
), enable cpu-specific optimizations (-march=native
).
-Ofast
turns all of these optimizations on, but causes Eigen
to crash. We therefore use these explicitly with Eigen
, and simply use -Ofast
otherwise.
Looking at the assembly of a simple test function, the difference between -O2 -ftree-vectorize
and -Ofast
looks fairly negligible. I’m defaulting to the higher optimization levels.
In the case of gfortran, -Ofast
looks like it generated excessive amounts of code relative to -O2 -ftree-vectorize
. This is because the compiler pursued an inefficient vectorization pattern; -fdisable-tree-cunrolli
prevents that, and produces more or less the same behavior as -O2 -ftree-vectorize
. I test both versions here for completeness’ sake. However, I’d be careful with that flag: many loops need it to vectorize, so I wouldn’t want to compile an entire program with it. When running into that problem before, I isolated the functions that needed it into a single file, and set the makefile to replace -flto
with -fdisable-tree-cunrolli
for that particular file.
Digression aside, here we create the five files, and compiling four in parallel build jobs:
open("eigen_mul.cpp", "w") do io
eigen_file_string = create_eigen_file_string(Mrange, Krange, Nrange);
write(io, eigen_file_string)
end;
open("blaze_mul.cpp", "w") do io
blaze_file_string = create_blaze_file_string(Mrange, Krange, Nrange);
write(io, blaze_file_string)
end;
open("fortran_intrinsic_mul.f90", "w") do io
gfort_file_string = create_fortran_file_string(Mrange, Krange, Nrange);
write(io, gfort_file_string)
end;
open("blaze_mul.cpp", "w") do io
blaze_file_string = create_blaze_file_string(Mrange, Krange, Nrange);
write(io, blaze_file_string)
end;
run(
`g++ -O3 -fno-signed-zeros -fno-trapping-math -fassociative-math -march=native -mprefer-vector-width=512 -shared -fPIC -I/usr/include/eigen3 eigen_mul.cpp -o libeigenmul.so` &
`g++ -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC blaze_mul.cpp -o libblazemul.so` &
`g++ -DBLAZE_USE_PADDING=0 -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC blaze_mul.cpp -o libblazemul_unpadded.so` &
`gfortran -Ofast -march=native -fdisable-tree-cunrolli -mprefer-vector-width=512 -shared -fPIC fortran_intrinsic_mul.f90 -o libgfortmul.so`
)
run(`gfortran -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC fortran_intrinsic_mul.f90 -o libgforttcmul.so`);
We don’t compile the two gfortran instances at the same time, because trying to compile two different instances of the same module simultaneously causes problems. But using a modules saves me the hastle from defining explicit interfaces for the BIND(C)
procedures, which saves me from the hastle of dealing with name-mangling.
Additionally, gfortran is fast, but the two templated C++ libraries take more than six minutes to compile the 900 functions on this computer.
To test Intel MKL’s JIT, we use a simple Fortran file:
module jitmul
include "mkl_direct_call.fi"
use ISO_C_BINDING
implicit none
contains
subroutine dgemmwrap(C,A,B,M,K,N,alpha,beta) bind(C, name = "dgemmwrapped")
integer, intent(in) :: M, K, N
! real(C_double), parameter :: alpha = 1.0, beta = 0.0
real(C_double), intent(in) :: alpha, beta
real(C_double), dimension(M,K), intent(in) :: A
real(C_double), dimension(K,N), intent(in) :: B
real(C_double), dimension(M,N), intent(out) :: C
call dgemm('N', 'N', M, N, K, alpha, A, M, B, K, beta, C, M)
end subroutine dgemmwrap
end module jitmul
This has the advantage of being very easy to use; all we must do is define the macro MKL_DIRECT_CALL_SEQ_JIT
while compiling, and we get to take advantage of JIT compilation.
run(`ifort --version`)
Writing it to a file, and compiling with a recent ifort that supports JIT for larger sizes:
open("fortran_mul.f90", "w") do io
write(io, """
module jitmul
#include "mkl_direct_call.fi"
use ISO_C_BINDING
implicit none
contains
subroutine dgemmwrap(C,A,B,M,K,N,alpha,beta) bind(C, name = "dgemmwrapped")
integer, intent(in) :: M, K, N
! real(C_double), parameter :: alpha = 1.0, beta = 0.0
real(C_double), intent(in) :: alpha, beta
real(C_double), dimension(M,K), intent(in) :: A
real(C_double), dimension(K,N), intent(in) :: B
real(C_double), dimension(M,N), intent(out) :: C
call dgemm('N', 'N', M, N, K, alpha, A, M, B, K, beta, C, M)
end subroutine dgemmwrap
end module jitmul
""")
end;
run(`ifort -fast -mkl -fpp -DMKL_DIRECT_CALL_SEQ_JIT -shared -fPIC fortran_mul.f90 -o libifortmul.so`);
I’m using (as of June 2nd) the most recent commit of PaddedMatrices, and Julia version:
using PaddedMatrices, LinearAlgebra, OffsetArrays, Random, BenchmarkTools
versioninfo()
Wrapping these libraries with ccall
:
const eigenlib = joinpath(pwd(), "libeigenmul.so")
const blazelib = joinpath(pwd(), "libblazemul.so")
const blazelib_unpadded = joinpath(pwd(), "libblazemul_unpadded.so")
const gfortlib = joinpath(pwd(), "libgfortmul.so")
const gforttclib = joinpath(pwd(), "libgforttcmul.so")
const ifortlib = joinpath(pwd(), "libifortmul.so")
using PaddedMatrices: AbstractMutableFixedSizePaddedMatrix
@generated function eigen_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,P},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,P},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N,P}
func = QuoteNode(Symbol(:mul_, P, :x, K, :times, K, :x, N))
quote
ccall(
($func, eigenlib), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}),
C, A, B
)
end
end
@generated function blaze_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,M},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,M},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N}
func = QuoteNode(Symbol(:mul_, M, :x, K, :times, K, :x, N))
quote
ccall(
($func, blazelib_unpadded), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}),
C, A, B
)
end
end
@generated function blaze_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,P},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,P},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N,P}
func = QuoteNode(Symbol(:mul_, M, :x, K, :times, K, :x, N))
quote
ccall(
($func, blazelib), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}),
C, A, B
)
end
end
@generated function gfort_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,P},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,P},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N,P}
func = QuoteNode(Symbol(:mul_, P, :x, K, :times, K, :x, N))
quote
ccall(
($func, gfortlib), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}),
C, A, B
)
end
end
@generated function gf_tc_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,P},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,P},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N,P}
func = QuoteNode(Symbol(:mul_, P, :x, K, :times, K, :x, N))
quote
ccall(
($func, gforttclib), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}),
C, A, B
)
end
end
function ifort_mul!(
C::AbstractMutableFixedSizePaddedMatrix{M,N,Float64,P},
A::AbstractMutableFixedSizePaddedMatrix{M,K,Float64,P},
B::AbstractMutableFixedSizePaddedMatrix{K,N,Float64,K}
) where {M,K,N,P}
ccall(
(:dgemmwrapped, ifortlib), Cvoid,
(Ptr{Cdouble}, Ptr{Cdouble}, Ptr{Cdouble}, Ref{Cint}, Ref{Cint}, Ref{Cint}, Ref{Cdouble}, Ref{Cdouble}),
C, A, B, Ref(Int32(P)), Ref(Int32(K)), Ref(Int32(N)), Ref(1.0), Ref(0.0)
)
end
Now, before looping over a range of tests, let’s confirm that each are giving us the same answer for an arbitrary size, and do an initial test for first impressions.
Specifying 7 rows in the matrix A
below tends to cause autovectorizers struggle, so I figure that is a good test case to highlight here. For that reason, I also manually specify the matrice’s size and disable padding.
The parameters in the curly braces { }
are:
- Number of rows
- Number of columns
- Element type
- Stride between columns.
If the stride between columns (parameter 4) equals the number of rows (parameter 1), then there is no extra padding between them. After initializing undefined matrices, we fill them with random numbers.
C₁ = MutableFixedSizePaddedMatrix{7,10,Float64,7}(undef);
A₁ = MutableFixedSizePaddedMatrix{7,32,Float64,7}(undef);
B₁ = MutableFixedSizePaddedMatrix{32,10,Float64,32}(undef);
randn!(A₁); randn!(B₁);
To multiply in place (overwriting the contents of C
) using PaddedMatrices, we use mul!
(exported by LinearAlgebra
, although the methods for PaddedMatrices
are defined in the latter library).
The macro @btime
runs the expressions many times to get a reasonably reliable estimate of the minimum time.
Testing each of the libraries:
@btime mul!($C₁, $A₁, $B₁); C₁
fill!(C₁, 0); @btime eigen_mul!($C₁, $A₁, $B₁); C₁
fill!(C₁, 0); @btime blaze_mul!($C₁, $A₁, $B₁); C₁
fill!(C₁, 0); @btime gfort_mul!($C₁, $A₁, $B₁); C₁
fill!(C₁, 0); @btime gf_tc_mul!($C₁, $A₁, $B₁); C₁
fill!(C₁, 0); @btime ifort_mul!($C₁, $A₁, $B₁); C₁
Initial impressions: the Julia library is clearly the fastest, but Intel MKL’s JIT is not far behind. Everything else is many times slower.
Now, a script to perform all the benchmarks:
function sorted_median(x)
N = length(x)
N < 1 && return NaN
if isodd(N)
x[(N+1)>>1]
else
h = N >> 1
(x[h] + x[h+1]) / 2
end
end
function bench_matmul!(minimum, median, C, A, B)
times = (@benchmark eigen_mul!($C, $A, $B)).times
minimum[1] = first(times)
median[1] = sorted_median(times)
times = (@benchmark blaze_mul!($C, $A, $B)).times
minimum[2] = first(times)
median[2] = sorted_median(times)
times = (@benchmark gfort_mul!($C, $A, $B)).times
minimum[3] = first(times)
median[3] = sorted_median(times)
times = (@benchmark gf_tc_mul!($C, $A, $B)).times
minimum[4] = first(times)
median[4] = sorted_median(times)
times = (@benchmark ifort_mul!($C, $A, $B)).times
minimum[5] = first(times)
median[5] = sorted_median(times)
times = (@benchmark mul!($C, $A, $B)).times
minimum[6] = first(times)
median[6] = sorted_median(times)
end
function bench_range(Mrange, Krange, Nrange)
mr = Mrange isa Integer ? (Mrange:Mrange) : Mrange
kr = Krange isa Integer ? (Krange:Krange) : Krange
nr = Nrange isa Integer ? (Nrange:Nrange) : Nrange
minimum_times = OffsetArray{Float64}(undef, 1:6, mr, kr, nr, 1:2)
median_times = OffsetArray{Float64}(undef, 1:6, mr, kr, nr, 1:2)
for m ∈ Mrange, k ∈ Krange, n ∈ Nrange
padded_A = @Mutable randn(m, k)
padded_B = @Mutable randn(k, n)
padded_C = PaddedMatrices.MutableFixedSizePaddedMatrix{m,n,Float64}(undef)
@views bench_matmul!( # test with padding
minimum_times[:,m,k,n,1], median_times[:,m,k,n,1],
padded_C, padded_A, padded_B
)
C = PaddedMatrices.MutableFixedSizePaddedMatrix{m,n,Float64,m}(undef)
A = PaddedMatrices.MutableFixedSizePaddedMatrix{m,k,Float64,m}(undef)
B = PaddedMatrices.MutableFixedSizePaddedMatrix{k,n,Float64,k}(undef)
A .= padded_A; B .= padded_B
@views bench_matmul!( # test without padding
minimum_times[:,m,k,n,2], median_times[:,m,k,n,2], C, A, B
)
end
minimum_times, median_times
end
Now, unfortuantely blaze only supports alligned memory accesses (vmovapd
) with its statitcally sized arrays. PaddedMatrices does not offer convenience functions for creating alligned matrices and only uses unaligned accesses (vmovupd
).
The differences:
- On some old architectures
vmovapd
was faster thanvmovupd
, but on reasonably recent computers (ie, AFAIK any desktop cpu in the last 5 years) they’re the same fast. - If the memory is not alligned,
vmovapd
will crash the program whilevmovupd
works equally quickly.
Memory being alligned means that the address is an integer multiple of the size of the vector being moved by the access.
So that means instead of using a regular PaddedMatrix, I allocate memory using malloc
, and wrap this pointer in PtrMatrices.
using PaddedMatrices: PtrMatrix
function bench_range(Mrange, Krange, Nrange)
mr = Mrange isa Integer ? (Mrange:Mrange) : Mrange
kr = Krange isa Integer ? (Krange:Krange) : Krange
nr = Nrange isa Integer ? (Nrange:Nrange) : Nrange
minimum_times = OffsetArray{Float64}(undef, 1:6, mr, kr, nr, 1:2)
median_times = OffsetArray{Float64}(undef, 1:6, mr, kr, nr, 1:2)
# max dims * bytes/element + 63
mem_per = (32^2)*8 + 63
ptr = Libc.malloc(3mem_per)
uptr = reinterpret(UInt, ptr)
# ptr_A, ptr_B, and ptr_C will be aligned on a 64 byte boundary
ptr_A = reinterpret(Ptr{Float64}, ( uptr + 63 ) & -64 )
ptr_B = reinterpret(Ptr{Float64}, ( uptr + mem_per + 63 ) & -64 )
ptr_C = reinterpret(Ptr{Float64}, ( uptr + 2mem_per + 63 ) & -64 )
for m ∈ Mrange, k ∈ Krange, n ∈ Nrange
# if m <= 4, then m either 3 or 4; otherwise make it a multiple of 8
P = m > 4 ? (m + 7) & -8 : 4
padded_A = PtrMatrix{m,k,Float64,P,(P*k),true}(ptr_A)
padded_B = PtrMatrix{k,n,Float64,k,(k*n),true}(ptr_B)
padded_C = PtrMatrix{m,n,Float64,P,(P*n),true}(ptr_C)
randn!(padded_A); randn!(padded_B)
@views bench_matmul!( # test with padding
minimum_times[:,m,k,n,1], median_times[:,m,k,n,1],
padded_C, padded_A, padded_B
)
unpadded_A = PtrMatrix{m,k,Float64,m,(m*k),true}(ptr_A)
unpadded_B = PtrMatrix{k,n,Float64,k,(k*n),true}(ptr_B)
unpadded_C = PtrMatrix{m,n,Float64,m,(m*n),true}(ptr_C)
randn!(unpadded_A); randn!(unpadded_B)
@views bench_matmul!( # test without padding
minimum_times[:,m,k,n,2], median_times[:,m,k,n,2],
unpadded_C, unpadded_A, unpadded_B
)
end
Libc.free(ptr)
minimum_times, median_times
end
minimum_times, median_times = bench_range(Mrange, Krange, Nrange);
Some boiler plate to get things to print in the way I want:
using DataFrames
function string_to_ind(str)
if str == "eigen"
i = 1
elseif str == "blaze"
i = 2
elseif str == "gfortran"
i = 3
elseif str == "gfortran tc"
i = 4
elseif str == "mkl" || str == "mkl jit"
i = 5
elseif str == "native julia"
i = 6
else
throw("$str not recognized; please specify either \"eigen\", \"blaze\", \"mkl\", or \"native julia\".")
end
return i
end
function DataFrames.showrowindices(io::IO,
df::AbstractDataFrame,
rowindices::AbstractVector{Int},
maxwidths::Vector{Int},
leftcol::Int,
rightcol::Int,
rowid) # -> Void
rowmaxwidth = maxwidths[end]
for i in rowindices
# Print row ID
if rowid isa Nothing
DataFrames.@printf io "| %d" i + 2
else
DataFrames.@printf io "| %d" rowid
end
padding = rowmaxwidth - DataFrames.ndigits(rowid isa Nothing ? i + 2 : rowid)
for _ in 1:padding
write(io, ' ')
end
print(io, " | ")
# Print DataFrame entry
for j in leftcol:rightcol
strlen = 0
if DataFrames.isassigned(df[j], i)
s = df[i, j]
strlen = DataFrames.ourstrwidth(io, s)
if ismissing(s)
DataFrames.printstyled(io, s, color=:light_black)
elseif s === nothing
strlen = 0
else
DataFrames.ourshow(io, s)
end
else
strlen = DataFrames.ourstrwidth(io, Base.undef_ref_str)
DataFrames.ourshow(io, Base.undef_ref_str)
end
padding = maxwidths[j] - strlen
for _ in 1:padding
write(io, ' ')
end
if j == rightcol
if i == rowindices[end]
print(io, " |")
else
print(io, " |\n")
end
else
print(io, " | ")
end
end
end
return
end
function print_results(compare1, compare2, times; padded = false, sigdigits = 3)
ind1 = string_to_ind(compare1)
ind2 = string_to_ind(compare2)
padind = padded ? 1 : 2
times_parent = times.parent
df = round.(times_parent[ind1,:,1,:,padind] ./ times_parent[ind2,:,1,:,padind], sigdigits = sigdigits) |> DataFrame
offset_m = times.offsets[2]; offset_n = times.offsets[4]
print_results(df, offset_m, offset_n)
end
function print_results(df, offset_m, offset_n)
names!(df, [Symbol(:NCols,j+offset_n) for j ∈ 1:size(df,2)])
show(df, allrows = true, allcols = true)
end
Because I am most interested in the performance of PaddedMatrices.jl
, I compare the relative runtimes of each of the other libraries with PaddedMatrices as the baseline.
First taking a look at Eigen, which seems to be the most popular of the alternatives, employed for example in Stan
and TensorFlow
:
print_results("eigen", "native julia", minimum_times, padded = false)
What surprises me is that Eigen is still slow when $\textbf{A} \in \mathbb{R}^{8i\times32}$ for integer $i$. That is, whenever the number of rows is a multiple of 8.
I’m fairly sure gfortran mostly relies on the autovectorizer for vectorization, an idea supported by how much the generated code changes when you change compiler flags.
So for all of Eigen’s templating and slow compilation, all it seems to have to show for it (vs gfortran) is slow compilation times and fat binaries (although Blaze is heaviest):
Shared Library | File Size |
---|---|
Eigen | 1.3M |
Blaze-Padded | 1.9M |
Blze-Unpadded | 2.7M |
gfortran | 0.529M |
gfortran-tc | 0.641M |
ifort mkl-jit | 0.018M |
Of course, MKL-JIT’s file size is small because it generates the matmul code “just in time” while running, instead of ahead of time as in the other examples..
Comparing Eigen vs gfortran so we can directly compare them (times are ratio of Eigen runtimes to gfortran runtimes):
print_results("eigen", "gfortran tc", minimum_times, padded = false)
gfortran did better for small matrices, and Eigen for large matrices. If you’re dealing with larger matrices in Fortran, you’d probably just link a BLAS library. The compiler will then turn matmul
calls into the appropriate calls from the linked library.
Blaze also did worse than gfortran when $M$, the number rows in matrices $\textbf{A}$ and $\textbf{C}$, was 4 or less. With larger $M$s, Blaze performed much better, especially when $\textbf{A}$ and $\textbf{C}$ had 16 rows:
print_results("blaze", "gfortran tc", minimum_times, padded = false)
We’ll note later that disabling tree-cunrolli
improves gfortran’s code a lot for the case of 16 rows — but not by the nearly 9x advantage Blaze has here.
Now, let’s take a look at Blaze. As we already saw by it doing much better relative to gfortran, it also did better than Eigen relative to PaddedMatrices without padding. Despite padding being the namesame of PaddedMatrices, it is not dependent on padding to generate optimal code — while Blaze apparently is.
print_results("blaze", "native julia", minimum_times, padded = false)
Blaze was faster when $M=8$ and $N=3$ or $N=4$, but slower otherwise.
Both gfortran versions were comparatively slow — at least they were fast to compile (and let you link alternative libraries).
print_results("gfortran", "native julia", minimum_times, padded = false)
print_results("gfortran tc", "native julia", minimum_times, padded = false)
More interesting than comparing them with Julia is to look at the impact of the tree-cunrolli compiler pass; numbers greater than 1 mean it was faster without the pass, and smaller mean it produced better code with:
print_results("gfortran tc", "gfortran", minimum_times, padded = false)
When $M$ was 8 or less, the pass helped. It seemed to make things a little slower when there were between 9 and 15 rows, much slower for 16 rows, and didn’t make a difference beyond that.
Finally, let’s take a look at MKL JIT:
print_results("mkl jit", "native julia", minimum_times, padded = false)
While still well behind unpadded PaddedMatrices, it takes an easy second place for the case where we do not have padding. It was actually slightly (negligibly?) faster for the cases of $M=26,N=31$ and $M=29,N=31$ (taking 2% and 0.03% less time), and within 20% for many of the remaining times.
Now, let us take a look at the padded case. Because Eigen didn’t do particularly well when we had multiples of 8 or 16 rows, we’d naturally guess that padding wouldn’t do it much good:
print_results("eigen", "native julia", minimum_times, padded = true)
Padding did not do it much good.
Blaze, on the otherhand, implements automatic padding. So between that $-$ meaning we expect it to be optimized for $M$ being a multiple of 8 $-$ and its good performance earlier when $M=8$, we expect it to do well:
print_results("blaze", "native julia", minimum_times, padded = true)
Now it beat PaddedMatrices when $M\leq8$ (and $M\neq4$), and $N=3$ or $N=4$. I’ll look into that later to improve performance for that case.
Other than this case, Blaze comes close but lags behind.
gfortran did worse than Julia across the board, so we’re not expecting any changes there.
The only thing I’m looking for is gfortran with tree-cunrolli
doing better when $M\leq8$ and without doing better when $9\leq M\leq16$. Beyond that, they should perform similarly.
print_results("gfortran", "native julia", minimum_times, padded = true)
print_results("gfortran tc", "native julia", minimum_times, padded = true)
MKL JIT, while still not very far behind PaddedMatrices:
print_results("mkl jit", "native julia", minimum_times, padded = true)
It is now behind Blaze for most of the smaller sizes, but when $M \geq 17$ and $N \geq 7$ (or when $M=4$), MKL JIT has the edge:
print_results("mkl jit", "blaze", minimum_times, padded = true)
All in all, PaddedMatrices is the clear fastest over this range of matrix sizes. If your matrices can be padded, or you’re working in C++ and want to be able to compile into an executable or shared library, Blaze is a great choice.
MKL JIT also did well (and has C bindings), and like PaddedMatrices, the JIT doesn’t force you to specify sizes ahead of time.
While padding made a large difference for a few of the tested libraries, it actually made little difference to PaddedMatrices itself:
pad_test = round.(minimum_times[end,:,32,:,1] ./ minimum_times[end,:,32,:,2], sigdigits = 3).parent |>
DataFrame
print_results(pad_test, 2, 2)
These are ratios of runtimes of padded vs unpadded. As most numbers are less than 1, padding did provide a slight performance advantage. Padding’s advantage is likely to be bigger than pre-avx512 architectures, because the unpadded avx512 code uses bitmasks to handle the non-integer multiples of vector widths. Eg, when the matrices are padded and $M = 14$, there will be two rows of padding so that we can pretend $M=16$ in vectorizing the code. When matrices are not padded, PaddedMatrices will instead “mask” the last two elements of a vector, which achieves the same effect.
I’ve found the advantages of not using padded are:
- It let’s you efficiently operate on larger matrices blockwise. You can’t treat those blockwise pieces as padded, or you’ll end up overwriting parts of the full matrix that’re outside of your submatrix!
- Easier for other libraries to interface, because many won’t be using matrices with padding.
- You need masking anyway when performing reductions (eg, summing all values of the matrix), because you don’t want the junk padding to contaminate results.
But there are many situations where these don’t apply, so I do think padding is a great approach in general.
For all my above comparisons I used minimum times instead of medium. Many argue (and I agree) that minimum runtimes are less biased when you don’t have a running garbage collector (Julia’s garbage collector wasn’t being triggered, because we weren’t using Julia to allocate memory when multiplying our matrices).
If you do have a garbage collector, that is consistent noise I think should be averaged in. Other sources of noise, however, are all random events that add on to the time. Regardless, median times were hardly higher than minimum times:
pad_test = round.(median_times[end,:,32,:,2] ./ minimum_times[end,:,32,:,2], sigdigits = 3).parent |>
DataFrame
print_results(pad_test, 2, 2)