GibbsVectorizationIntro

Before we try to hand-optimize the Gibbs model, let’s talk some about how to actually optimize code. Or at least hit a few basics.

Here I wanted to write a short introduction to writing fast code.

For code to run fast, there are a few basic principles:

1. More efficient algorithm. This is where the biggest wins are possible!
2. Don’t compute anything you don’t have to. Don’t compute anything more than once (unless it’s faster/cheaper to re-compute than to store and look up).
3. Parallelism. CPU clock speeds have largely stopped increasing in recent years. Instead, recent increases in CPU throughput have come in the form of parallelism. Parallelism takes many forms, the most obvious which is multiple separate CPU cores. Another major source of speed up is SIMD (Single Instruction Multiple Data), where a single instruction from one CPU core can act on multiple pieces of data simultaneously. Most recent CPUs have 256 bit registers (where the data is held while the CPU can operate on it). A standard double precision number is 64 bits, meaning these CPUs can potentially operate on 4 numbers at a time. A few CPUs have already upgraded to 512 bit registers, potentially acting on 8 double precision numbers or 16 single precision numbers per opertion.
4. Memory is slow! Consider this benchmark:
In :
using BenchmarkTools, SIMDArrays
A = @Static randn(8,13)
B = @Static randn(13,10)
@benchmark foreach(_ -> $A *$B, 1:1000)

Out:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     18.761 μs (0.00% GC)
median time:      19.007 μs (0.00% GC)
mean time:        19.051 μs (0.00% GC)
maximum time:     55.397 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1
In :
# 299_792_458 speed of light in m/s * 25 ns * 1 s / 1e9 ns = meters light travels
299_792_458  * 19 * 1 / 1e9

Out:
5.696056702

I repeated the matrix multiplication benchmark 1000 times to get more accurate timings.

Light travels 299,792,458 meters per second. That means that light can only travel about 5.7 meters in the time it takes to multiply an $8 \times 12$ and a $12 \times 10$ matrix.

That computation requires taking $8 * 10 = 80$ dot products with vectors of length 13. Each of these dot products requires 13 multiplications and 12 additions. That means that a single CPU core did 1040 multiplications and 960 additions in the time it took light to travel 5.7 meters, less than the width of a typical office.

Each core of a computer can churn through data blisteringly fast.
The CPU’s speed is battling the laws of physics. But memory is much slower. Figuring out where the memory actually is that we’d like to access, and moving it to the CPU is extremely slow compared to how fast the CPU is at handling memory that is actually there. RAM can be far away.

An example of where this can be a problem, consider a simple dot product.

In :
using LinearAlgebra
x = rand(128); y = rand(128);
@btime foreach(_ -> $x'*$y, 1:10^6)

  14.562 ms (0 allocations: 0 bytes)

In :
x = rand(128 * 10^6); y = rand(128 * 10^6);
@btime $x'*$y

  154.016 ms (0 allocations: 0 bytes)

Out:
3.200391627717342e7
In :
156.32 / 14.843

Out:
10.531563700060634

Multiplying the length of a vector by a million makes the dot product more than 10 million times slower!
More or less, that means for every microsecond the CPU spent working, it spent at least 9 microseconds just waiting for data to show up!

I won’t focus on this much in this post, but it’s something to be mindful of while trying to write fast code.
Try not to jump around in memory much. CPUs have different levels of cache memory. These caches are much smaller than RAM, but they’re much faster. Typically, L3 is large and shared among cores, while each core gets their own L2 and L1 cache. The L1 cache is the smallest, but also the closest to the action and the fastest.
On my computer, each core has 32 KiB of L1 data cache, meaning it can store roughly 32 KiB * 1024 B/KiB / (8 B / 64-bit float) = 4096 64 bit floats in the L1 cache. Breaking bigger problems down and processing them in chunks small enough to fit in the L1 cache, eg using less than 4096 doubles at a time, will be faster.

Many optimized libraries and aglorithms do this. For example, optimized matrix multiplication routines do this for via multiplying big matrices block-wise, where these blocks are small enough to fit in the L1 cache.

My focus here will be on the SIMD (Single Instruction Multiple Data) instructions. This is also often known as vectorization, because each instruction operates on a “vector” of data. This can be confusing for those coming from R, MATLAB, or Python, where “vectorization” often means “not using a for loop”.
Vectorizing in languages with great compilers, such as Julia, C++, or Fortran, means having the compiled generate SIMD instructions.
Many “vectorized” functions from the former sets of languages will also often really be vectorized in the sense of the latter, because they’re likely to be calling compiled C/C++/Fortran code.

Within the compiled languages, “for loops” often are vectorized. For example,

In :
function my_dot_product(a, b)
d = 0.0
@inbounds @simd for i ∈ eachindex(a, b)
d += a[i] * b[i]
end
d
end
x = rand(128); y = rand(128)
@btime $x' *$y

  16.519 ns (0 allocations: 0 bytes)

Out:
34.05424279632278
In :
@btime my_dot_product($x,$y)

  7.090 ns (0 allocations: 0 bytes)

Out:
34.054242796322775

We get the same answer as the “builtin” bar rounding error, but our function is actually faster! That’s because the base dot product actually calls an external BLAS library instead of native Julia code. The advantage of the external library is that it is multithreaded (so long as we don’t set BLAS.set_num_threads(1)), which speeds it up for really big vectors. Memory throughput should be higher when there’s more cores.
For vectors too small to benefit, it’s faster not to call an external library.

We can look at the assembly instructions of my_dot_product with @code_native.
When looking for vecorization, the things to look for are vector registers. xmm registers are 128 bits, ymm registers are 256 bits, and zmm registers are 512 bits. An instruction will often look like this:

vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0


the instruction is named vfmadd231. The fmadd says what it is doing. fmadd means fused-multiply-add, meaning it is doing both an addition and a multiplication. add does addition, sub subtraction, mul multiplication, div division, etc. mov means to move memory from one place to another.

The second last letter, p, means “packed”, meaning that it is operating on a vector “packed” with numbers. It could alternatively say s, which means it is only operating on a single number, even if the register could hold an entire vector.
The d means “double precision”, meaning the operations is multiplying and adding double precision (64 bit) floating point numbers. Remember that the zmm register holds 512 bits — that’s just 512 zeros/ones. That last letter of the instruction specifies how exactly those zeros and ones are being interpreted. q means integer, and s means single precision.

This instruction has 3 inputs (but many have only 1 or 2): it is multiplying numbers from a region in memory (%rdx,%rdi,8) with the register zmm4, and adding the result to zmm0.
In AT&T syntax (which this is), the answer gets stored on the right most of the registers that follow, meaning zmm0 gets overwritten with the sum of itself, and the product of (%rdx,%rdi,8) and %zmm4. That is:

%zmm0 = (%rdx,%rdi,8) * %zmm4 + %zmm0

Because zmm registers hold 8 doubles, this single instruction does 8 multiplications and 8 additions.

In :
@code_native my_dot_product(x, y)

	.text
; ┌ @ In:3 within my_dot_product'
; │┌ @ simdloop.jl:65 within macro expansion'
; ││┌ @ abstractarray.jl:209 within eachindex' @ abstractarray.jl:219 @ abstractarray.jl:216
; │││┌ @ abstractarray.jl:95 within axes1'
; ││││┌ @ abstractarray.jl:75 within axes'
; │││││┌ @ In:2 within size'
subq	$40, %rsp movq 24(%rdi), %rcx ; │││││└ ; │││││┌ @ tuple.jl:125 within map' ; ││││││┌ @ range.jl:318 within Type' @ range.jl:309 ; │││││││┌ @ promotion.jl:414 within max' movq %rcx, %rax sarq$63, %rax
andnq	%rcx, %rax, %rax
; │││└└└└└
; │││ @ abstractarray.jl:209 within eachindex' @ array.jl:155
movq	24(%rsi), %r8
; │││ @ abstractarray.jl:209 within eachindex' @ abstractarray.jl:220
; │││┌ @ abstractarray.jl:226 within _all_match_first'
; ││││┌ @ abstractarray.jl:220 within #91'
; │││││┌ @ abstractarray.jl:216 within eachindex'
; ││││││┌ @ abstractarray.jl:95 within axes1'
; │││││││┌ @ abstractarray.jl:75 within axes'
; ││││││││┌ @ tuple.jl:125 within map'
; │││││││││┌ @ range.jl:318 within Type' @ range.jl:309
; ││││││││││┌ @ promotion.jl:414 within max'
movq	%r8, %rdx
sarq	$63, %rdx andnq %r8, %rdx, %rdx ; ││││└└└└└└└ ; ││││┌ @ range.jl:718 within ==' @ promotion.jl:403 cmpq %rdx, %rax ; │││└└ jne L300 ; │││ @ abstractarray.jl:209 within eachindex' @ abstractarray.jl:219 @ abstractarray.jl:216 ; │││┌ @ abstractarray.jl:95 within axes1' ; ││││┌ @ abstractarray.jl:75 within axes' ; │││││┌ @ tuple.jl:125 within map' ; ││││││┌ @ range.jl:318 within Type' @ range.jl:309 ; │││││││┌ @ promotion.jl:414 within max' testq %rcx, %rcx ; ││└└└└└└ ; ││ @ simdloop.jl:68 within macro expansion' jle L73 movq (%rdi), %rcx movq (%rsi), %rdx ; ││ @ simdloop.jl:71 within macro expansion' cmpq$32, %rax
jae	L85
; └└
; ┌ @ simdloop.jl within my_dot_product'
vxorpd	%xmm0, %xmm0, %xmm0
xorl	%esi, %esi
jmp	L272
L73:
vxorps	%xmm0, %xmm0, %xmm0
; └
; ┌ @ In:6 within my_dot_product'
addq	$40, %rsp vzeroupper retq ; │ @ In:3 within my_dot_product' ; │┌ @ simdloop.jl:71 within macro expansion' L85: movabsq$9223372036854775776, %rsi # imm = 0x7FFFFFFFFFFFFFE0
andq	%rax, %rsi
vxorpd	%xmm0, %xmm0, %xmm0
xorl	%edi, %edi
vxorpd	%xmm1, %xmm1, %xmm1
vxorpd	%xmm2, %xmm2, %xmm2
vxorpd	%xmm3, %xmm3, %xmm3
nopw	%cs:(%rax,%rax)
; ││ @ simdloop.jl:73 within macro expansion' @ In:4
; ││┌ @ array.jl:729 within getindex'
L128:
vmovupd	(%rcx,%rdi,8), %zmm4
vmovupd	64(%rcx,%rdi,8), %zmm5
vmovupd	128(%rcx,%rdi,8), %zmm6
vmovupd	192(%rcx,%rdi,8), %zmm7
; ││└
; ││┌ @ float.jl:395 within +'
; │└└
; │┌ @ int.jl:53 within macro expansion'
addq	$32, %rdi cmpq %rdi, %rsi jne L128 ; │└ ; │┌ @ simdloop.jl:73 within macro expansion' @ In:4 ; ││┌ @ float.jl:395 within +' vaddpd %zmm0, %zmm1, %zmm0 vaddpd %zmm0, %zmm2, %zmm0 vaddpd %zmm0, %zmm3, %zmm0 vextractf64x4$1, %zmm0, %ymm1
vextractf128	$1, %ymm0, %xmm1 vaddpd %zmm1, %zmm0, %zmm0 vpermilpd$1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
cmpq	%rsi, %rax
; └└└
; ┌ @ simdloop.jl:71 within my_dot_product'
je	L292
nopw	%cs:(%rax,%rax)
; └
; ┌ @ In:3 within my_dot_product'
; │┌ @ simdloop.jl:73 within macro expansion' @ In:4
; ││┌ @ array.jl:729 within getindex'
L272:
vmovsd	(%rcx,%rsi,8), %xmm1    # xmm1 = mem,zero
; ││└
; ││┌ @ float.jl:395 within +'
; ││└
; ││ @ simdloop.jl:74 within macro expansion'
; ││┌ @ int.jl:53 within +'
addq	$1, %rsi ; ││└ ; ││ @ simdloop.jl:71 within macro expansion' ; ││┌ @ int.jl:49 within <' cmpq %rax, %rsi ; ││└ jb L272 ; │└ ; │ @ In:6 within my_dot_product' L292: addq$40, %rsp
vzeroupper
retq
; │ @ In:3 within my_dot_product'
; │┌ @ simdloop.jl:65 within macro expansion'
; ││┌ @ abstractarray.jl:209 within eachindex' @ abstractarray.jl:220
L300:
movabsq	$jl_system_image_data, %rax movq %rax, 8(%rsp) movabsq$jl_system_image_data, %rax
movq	%rax, 16(%rsp)
movq	%rdi, 24(%rsp)
movq	%rsi, 32(%rsp)
movabsq	$jl_invoke, %rax movabsq$140261687004432, %rdi  # imm = 0x7F9138078D10
leaq	8(%rsp), %rsi
movl	$4, %edx callq *%rax ud2 nopw %cs:(%rax,%rax) ; └└└  That is a lot of output, but the most important part: L128: vmovupd (%rcx,%rdi,8), %zmm4 vmovupd 64(%rcx,%rdi,8), %zmm5 vmovupd 128(%rcx,%rdi,8), %zmm6 vmovupd 192(%rcx,%rdi,8), %zmm7 ; ││└ ; ││┌ @ float.jl:395 within +' vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0 vfmadd231pd 64(%rdx,%rdi,8), %zmm5, %zmm1 vfmadd231pd 128(%rdx,%rdi,8), %zmm6, %zmm2 vfmadd231pd 192(%rdx,%rdi,8), %zmm7, %zmm3 ; │└└ ; │┌ @ int.jl:53 within macro expansion' addq$32, %rdi
cmpq    %rdi, %rsi
jne L128


The vmovupd instruction loads data from memory into a register. Once these are loaded, it uses the vfmadd instructions to multiply them with more numbers from a register, and update cumulative sums.
It does each of these things four times total. Meaning it loads does 32 multiplications and 32 additions.

Then it increments a loop counter (\%rdi) with addq, then cmpq (compare-integers) to check if the loop is over, and finally conditionally jumps (jne) back to L128 (location marked 128) or continues on in the code. L128 is the start of the loop, just before memory was loaded. Meaning if it jumps back, the loop, loops.

So hold on. We wrote a for loop, where each iteration we said get 1 element from both arrays, multiply the two of tem, and add them to the cumulative sum. But this assembly code is instead saying, do that to 32 numbers per loop iteration!
That’s not what our code said!

Tricks like this are a part of why languages like C++ are fast: the compiler sees what you wrote, but then tells the computer to do something that will get the same answer, but is much faster.

Actually, that is not quite true here. Floating point arithmetic is not associative. That means that $(a + b) + c$ does not not necessarilly equal $a + (b + c)$, due to rounding errors.

For example:

In :
1e18 + 65.0 - 1e18 - 65.0

Out:
63.0
In :
foo(a, b) = a + b - a - b
foo(1e18, 65)

Out:
63.0
In :
foo(1e18, 63)

Out:
-63.0

You’re likely to get the same result in any language. Why?

Because the machine error at a number as massive as 1e18 is 128.

In :
eps(1e18)

Out:
128.0

This means that the difference between 1e18 and the next two closest numbers 64 bit floats can represent is 128.

In :
nextfloat(1e18) - 1e18

Out:
128.0
In :
1e18 - prevfloat(1e18)

Out:
128.0

Therefore, 1e18 + 65 rounds to 1e18 + 128. Then when we subtract 1e18, we get 128. Subtract 65, and we have 63.

Similarly, 1e18 + 63 rounds to 1e18. Subtract 1e18, and we have 0. Subtract 63, and we have -63.

Because associativity is not true for floating point numbers, compilers will not, by default, apply any associative-transforms to reorder floating point math. When in doubt, it is safer to assume you meant exactly what you wrote in the way you wrote it.

It would also be confusing if turning on compiler optimizations by itself suddenly changed the answer you got! Rounding errors are not always trivial. $\pm 63$ is not trivial!

An example of an algorithm written with the (lack of) associativity in mind, see Kahan summation. Kahan summation is a numerically accurate summation algorithm. It works by, while summing up your vector, also calculating and tracking the numerical erorr of your sum, and repeatedly adding this numerical eror back onto the sum to correct it.
If a compiler were allowed to assume that floating point math is associative, it would realize that the numerical error is (given associativity!) zero, and therefore delete all your code handling error. Then, suddenly, your accurate summation algorithm is a regular summation algorithm.

In our dot product, however, the compiler did assume associativity, and reorder our computations.
It didn’t calculate one product, add it to a single cumulative sum, and repeat. It calculated 32 products at a time, accumulating 32 separate sums. And only at the end, did it finally add these 32 sums together (plus add a second loop to cover any straglers, because most vectors’ length’s aren’t a product of 32).
While that method is actually more accurate than simply doing it in order with a single sum. But it isn’t what we wrote, and the compiler can’t know on it’s own that we were not intending to do something fancy like Kahan summation.
So we have to give the compiler permission to optimize, telling it that isn’t what we were doing. That is what the @simd in front of our for loop does.

Another macro that does this and applies to more than for loops is @fastmath:

In :
fast_foo(a, b) = @fastmath a + b - a - b
fast_foo(1e18, 65.0)

Out:
0.0
In :
@code_native foo(1e18, 65.0)

	.text
; ┌ @ In:1 within foo'
; │┌ @ In:1 within +'
; │└
; │┌ @ float.jl:397 within -'
vsubsd	%xmm0, %xmm2, %xmm0
vsubsd	%xmm1, %xmm0, %xmm0
; │└
retq
nopl	(%rax)
; └

In :
@code_native fast_foo(1e18, 65.0)

	.text
; ┌ @ In:1 within fast_foo'
vxorps	%xmm0, %xmm0, %xmm0
retq
nopw	%cs:(%rax,%rax)
; └


the vxorps instruction does the xor operation. xor means, “or, but not and”. xor-ing anything with itself always equals zero because everything is “and” themselves. So while foo takes us literally and does exactly what we asked for, fast_foo did the algebra, relealized everything cancels out, and realized all it has to do is spit out “0”.

Also, integer arithmetic, unlike floating point arithmetic, is associative. So no promises needed:

In :
@code_native foo(10^18, 65)

	.text
; ┌ @ In:1 within foo'
xorl	%eax, %eax
retq
nopw	%cs:(%rax,%rax)
; └


Our regular, non-fastmath foo, when compiling a version for integers, also just xors to spit out a zero.

Another reason these compiled languages are faster is “2.” from our list: not doing anything you don’t have to, and just doing the things you need to once. eg, sorting out all the administrative details of what the input data is, and then what happens to it.
Figuring out what to do on the fly can take many times longer than just doing it, if “it” is fast.
In interpreted languages, like R and Python, each variable is a reference to data. Then when you want to do something with those variables, the interpreter has to “interpret” the meanings of everything as it happens, figuring out what the variables are, what it’s supposed to do with them, etc. We can emulate this from Julia like this:

In :
x = Ref(1.0); y = Ref(1.0)
@btime x[] + y[]

  34.876 ns (3 allocations: 48 bytes)

Out:
2.0

The most important factor of the above benchmark was leaving off the $ I used earlier each time I benchamrked. The $ lets the benchmark know the types of the variables I am benchmarking. This would let it emulate calling the benchmarked expression from within a function. Inside a function, Julia should (normally) be aware of the types of all the variables, which lets it sort out the sequence of computations it actually has to do ahead of time.

Now, 30-something nanoseconds isn’t bad on its own.
If R spends a 100ns interpreting things when you call rstan, who cares?

If all you’re doing is calling big functions or “vectorized” routines written in other languages, performance is fine. Actually, if that is all you’re doing, the interpreter is likely to spend a lot less time figuring out which already-compiled C code it should be running than Julia will compiling code before it gets to run it.
Try comparing how long it takes a fresh R and Julia session to load a plotting package and spit out the first plot.

The problem becomes apparent when you want to write fast routines yourself, building them out of lots of little instructions.
Imagine if, in our matrix multiplication, our 1040 multiplications and 960 additions each took 30+ ns. We certainly couldn’t have done it all in less than 20 ns!

Or, see what happens to a for loop:

In :
x = rand(128); y = rand(128)
@btime (d = 0; for i ∈ eachindex(x, y); d += x[i] * y[i]; end; d)

  9.736 μs (640 allocations: 11.98 KiB)

Out:
31.624253451226192
In :
@btime my_dot_product($x,$y)

  9.379 ns (0 allocations: 0 bytes)

Out:
31.624253451226178

Scaling this up, you can imagine: if some Julia code takes 1 hour to run, it could take a month to run interpreted.
In cases like that, who cares about compiling for a second?

This can be important if you’d like to run length simulations, without being constrained to using tools that already exist. How often is the best tool for the job something that’s already there and available?
Of course, other tools like Rcpp exist to do this from within R, and popular packages like JAGS/NIMBLE and rstan will translate a DSL (domain-specific language) to C++ and compile it.

There are a few tricks to keep in mind when trying to get code to vectorize.

So let’s walk through an example I had fun with:

In :
@inline function pdforwardsolve(x1,x2,x3,S11,S12,S22,S13,S23,S33)
@fastmath begin
R11 = 1 / sqrt(S11)
R11x1 = R11 * x1
L21 = R11 * S12
L31 = R11 * S13
R22 = 1 / sqrt(S22 - L21*L21)
L32 = R22 * (S23 - L21 * L31)
R33 = 1 / sqrt(S33 - L31*L31 - L32*L32)

nR21x1 = R22 * L21 * R11x1
R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
nR32 = R33 * L32 * R22

(
R11x1,
R22*x2 - nR21x1,
R31x1 - nR32*x2 + R33*x3
)
end
end

Out:
pdforwardsolve (generic function with 1 method)

This function calculates:

\begin{align*}
\textbf{L} &= \text{chol}(\textbf{S})\\
\textbf{y} &= \textbf{L}^{-1}\textbf{x},
\end{align*}

where $\textbf{S}$ is a $3\times 3$ symmetric and positive definite matrix.

Our data consists of a large number of vectors $\textbf{x}$ and matrices $\textbf{S}$, and we want to solve for $\textbf{y}$ for each of them. So we want a function to calculate this for every element.

The most important trick to remember for vectorizing code is that CPUs can efficiently load or operate on contiguous data, and “under the hood”, all data is stored linearly as a vector. That means in a matrix like:

In :
reshape(collect(1:12), (4,3))

Out:
4×3 Array{Int64,2}:
1  5   9
2  6  10
3  7  11
4  8  12

The 4 and 5 are next to each other in memory, while the 1 and 5, despite appearing next to each other in the matrix, are 4 appart. This is called column-major storage, because data within a column are next to one another.

Row-major storage is the opposite: data within a row are next to each other.
So, while trying to get a function to vectorize, we need to ask:

1. What pattern would let us do multiple computations simultaneously?
2. How then will we have to organize our data to let that happen?

Now, the standard advice you see online is, because pdforwardsolve_fm operates on one vector and one matrix at a time, we should store our data so that the data in each of these are close together. For example:

Accessing the memory in the order in which it is stored will be faster. In Fortran, this means writing loops like

do j = 1, n
do i = 1, n
! operate on A(i,j) ...
end do
end do


On the other hand, in C we want:

for (i = 0; i < n; i++ {
for (j = 0; j < n; j++) {
/* operate on A[i][j] ... */
}
}


Or if you want something official looking, from MATLAB’s documentation:

Code segment 2 is faster because it traverses the elements of the 2-D array by going down the columns in the inner loop. If the algorithm permits, you can also maximize cache efficiency by using the single index method, x(k), instead of x(r,c).

So, lets try that. First, some code to generate a fake data set.

In :
using StaticArrays, Random, BenchmarkTools
N = 2827
Y32 = Matrix{Float32}(undef, N, 3);
BPP32 = Matrix{Float32}(undef, N, 10);
@inline function randpd(::Type{T} = Float64, ::Val{N} = Val(3), ::Val{R} = Val(N)) where {T,N,R}
(@SMatrix randn(T, 2N, N)) |> x -> x' * x;
end

@inline function randpdv(::Type{T}) where T
sS = randpd(T, Val(3))
@inbounds SVector(sS, sS, sS, sS, sS, sS)
end

function fillpd!(X::AbstractMatrix{T}) where T
@inbounds for i ∈ 1:size(X,1)
X[i,:] = randpdv(T)
end
end
randn!(@view BPP32[:,1:4]);
fillpd!(@view BPP32[:,5:10])

BPP = copy(BPP32')
Y = Matrix{Float32}(undef, 3, N)

function juliatest_1!(Y::AbstractMatrix, BPP::AbstractMatrix)
@inbounds for i ∈ 1:size(BPP,2)
Y[:,i] .= pdforwardsolve(
BPP[1,i],BPP[2,i],BPP[3,i],
BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i]
)
end
end

@benchmark juliatest_1!($Y,$BPP)

Out:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     18.924 μs (0.00% GC)
median time:      19.005 μs (0.00% GC)
mean time:        19.039 μs (0.00% GC)
maximum time:     34.947 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     1

Is 19 microseconds good? Hard to know without a baseline, but let’s take a quick glance at the assembly.

In :
@code_native juliatest_1!(Y, BPP)

	.text
; ┌ @ In:27 within juliatest_1!'
pushq	%rbp
pushq	%r15
pushq	%r14
pushq	%r13
pushq	%r12
pushq	%rbx
subq	$24, %rsp movq %rsi, 16(%rsp) movq 8(%rsi), %rax ; │┌ @ array.jl:154 within size' movq 32(%rax), %r12 ; │└ ; │┌ @ range.jl:5 within Colon' ; ││┌ @ range.jl:275 within Type' ; │││┌ @ range.jl:280 within unitrange_last' ; ││││┌ @ operators.jl:333 within >=' ; │││││┌ @ int.jl:428 within <=' testq %r12, %r12 ; │└└└└└ jle L479 movq (%rsi), %rcx movq 24(%rcx), %r14 movq (%rax), %rbx movq 24(%rax), %rbp testq %r14, %r14 ; │ @ In:27 within juliatest_1!' jle L298 movq (%rcx), %r13 ; │ @ In:27 within juliatest_1!' shlq$2, %r14
addq	$36, %rbx shlq$2, %rbp
movabsq	$140262029000352, %rax # imm = 0x7F914C69FEA0 vmovss (%rax), %xmm0 # xmm0 = mem,zero,zero,zero vmovss %xmm0, 12(%rsp) movabsq$__memcpy_avx_unaligned_erms, %r15
nopl	(%rax,%rax)
; │ @ In:28 within juliatest_1!'
; │┌ @ array.jl:730 within getindex'
L112:
vmovss	-32(%rbx), %xmm8        # xmm8 = mem,zero,zero,zero
vmovss	-20(%rbx), %xmm1        # xmm1 = mem,zero,zero,zero
; │└
; │┌ @ In:3 within pdforwardsolve'
; ││┌ @ fastmath.jl:271 within sqrt_fast'
vsqrtss	%xmm1, %xmm1, %xmm1
; │└└
; │┌ @ array.jl:730 within getindex'
vmovss	(%rbx), %xmm2           # xmm2 = mem,zero,zero,zero
vmovss	12(%rsp), %xmm0         # xmm0 = mem,zero,zero,zero
; │└
; │┌ @ In:3 within pdforwardsolve'
; ││┌ @ fastmath.jl:254 within div_fast' @ fastmath.jl:164
vdivss	%xmm1, %xmm0, %xmm1
; ││└
; ││ @ In:4 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	-36(%rbx), %xmm1, %xmm3
; │└└
; │┌ @ fastmath.jl:163 within pdforwardsolve'
vmulss	-16(%rbx), %xmm1, %xmm4
; │└
; │┌ @ In:6 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	-8(%rbx), %xmm1, %xmm1
vmovss	-12(%rbx), %xmm5        # xmm5 = mem,zero,zero,zero
; │└└
; │┌ @ fastmath.jl:162 within pdforwardsolve'
; │└
; │┌ @ In:7 within pdforwardsolve'
; ││┌ @ fastmath.jl:271 within sqrt_fast'
vsqrtss	%xmm5, %xmm5, %xmm5
; │└└
; │┌ @ fastmath.jl:164 within pdforwardsolve'
vdivss	%xmm5, %xmm0, %xmm5
; │└
; │┌ @ In:11 within pdforwardsolve'
; ││┌ @ fastmath.jl:169 within mul_fast' @ fastmath.jl:163
vmulss	%xmm3, %xmm4, %xmm6
; │└└
; │┌ @ fastmath.jl:162 within pdforwardsolve'
; │└
; │┌ @ In:8 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	%xmm5, %xmm4, %xmm4
vmulss	%xmm1, %xmm1, %xmm7
vsubss	%xmm7, %xmm2, %xmm2
; │└└
; │┌ @ fastmath.jl:271 within pdforwardsolve'
vsqrtss	%xmm2, %xmm2, %xmm2
; │└
; │┌ @ In:9 within pdforwardsolve'
; ││┌ @ fastmath.jl:254 within div_fast' @ fastmath.jl:164
vdivss	%xmm2, %xmm0, %xmm2
; │└└
; │┌ @ fastmath.jl:163 within pdforwardsolve'
vmulss	%xmm5, %xmm6, %xmm7
; │└
; │┌ @ In:12 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	%xmm3, %xmm1, %xmm1
; ││└
; ││┌ @ fastmath.jl:162 within sub_fast'
vfmsub231ss	%xmm7, %xmm4, %xmm1
; ││└
; ││ @ In:13 within pdforwardsolve'
; ││┌ @ fastmath.jl:169 within mul_fast' @ fastmath.jl:163
vmulss	%xmm8, %xmm5, %xmm7
; ││└
; ││ @ In:15 within pdforwardsolve'
; ││┌ @ fastmath.jl:161 within add_fast'
vmulss	%xmm2, %xmm7, %xmm1
; ││└
; ││┌ @ fastmath.jl:162 within sub_fast'
vsubss	%xmm6, %xmm8, %xmm0
vmulss	%xmm0, %xmm5, %xmm0
; │└└
; │┌ @ broadcast.jl:800 within materialize!'
; ││┌ @ broadcast.jl:841 within copyto!' @ broadcast.jl:885
; │││┌ @ broadcast.jl:868 within preprocess'
; ││││┌ @ broadcast.jl:872 within preprocess_args'
; │││││┌ @ broadcast.jl:869 within preprocess'
vmovss	%xmm3, (%rsp)
vmovss	%xmm0, 4(%rsp)
vmovss	%xmm1, 8(%rsp)
; │││└└└
; │││ @ broadcast.jl:841 within copyto!' @ array.jl:767
movq	%r13, %rdi
movq	%rsp, %rsi
movq	%r14, %rdx
callq	*%r15
; │└└
; │┌ @ range.jl:595 within iterate'
; ││┌ @ promotion.jl:403 within =='
addq	$-1, %r12 ; │└└ jne L112 jmp L479 ; │ @ In:27 within juliatest_1!' L298: addq$36, %rbx
shlq	$2, %rbp movabsq$140262029000352, %rax  # imm = 0x7F914C69FEA0
vmovss	(%rax), %xmm8           # xmm8 = mem,zero,zero,zero
; │ @ In:28 within juliatest_1!'
; │┌ @ array.jl:730 within getindex'
L320:
vmovss	-32(%rbx), %xmm1        # xmm1 = mem,zero,zero,zero
vmovss	-20(%rbx), %xmm2        # xmm2 = mem,zero,zero,zero
; │└
; │┌ @ In:3 within pdforwardsolve'
; ││┌ @ fastmath.jl:271 within sqrt_fast'
vsqrtss	%xmm2, %xmm2, %xmm2
; │└└
; │┌ @ array.jl:730 within getindex'
vmovss	(%rbx), %xmm3           # xmm3 = mem,zero,zero,zero
; │└
; │┌ @ In:3 within pdforwardsolve'
; ││┌ @ fastmath.jl:254 within div_fast' @ fastmath.jl:164
vdivss	%xmm2, %xmm8, %xmm2
; │└└
; │┌ @ fastmath.jl:163 within pdforwardsolve'
vmulss	-36(%rbx), %xmm2, %xmm4
; │└
; │┌ @ In:5 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	-16(%rbx), %xmm2, %xmm5
; ││└
; ││ @ In:6 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	-8(%rbx), %xmm2, %xmm2
vmovss	-12(%rbx), %xmm6        # xmm6 = mem,zero,zero,zero
; ││└
; ││ @ In:7 within pdforwardsolve'
; ││┌ @ fastmath.jl:162 within sub_fast'
; │└└
; │┌ @ fastmath.jl:271 within pdforwardsolve'
vsqrtss	%xmm6, %xmm6, %xmm6
; │└
; │┌ @ In:7 within pdforwardsolve'
; ││┌ @ fastmath.jl:254 within div_fast' @ fastmath.jl:164
vdivss	%xmm6, %xmm8, %xmm6
; │└└
; │┌ @ fastmath.jl:163 within pdforwardsolve'
vmulss	%xmm4, %xmm5, %xmm7
; │└
; │┌ @ In:8 within pdforwardsolve'
; ││┌ @ fastmath.jl:162 within sub_fast'
; ││└
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	%xmm6, %xmm5, %xmm5
vmulss	%xmm2, %xmm2, %xmm0
vsubss	%xmm0, %xmm3, %xmm0
; │└└
; │┌ @ fastmath.jl:271 within pdforwardsolve'
vsqrtss	%xmm0, %xmm0, %xmm0
; │└
; │┌ @ In:9 within pdforwardsolve'
; ││┌ @ fastmath.jl:254 within div_fast' @ fastmath.jl:164
vdivss	%xmm0, %xmm8, %xmm0
; │└└
; │┌ @ fastmath.jl:163 within pdforwardsolve'
vmulss	%xmm6, %xmm7, %xmm3
; │└
; │┌ @ In:12 within pdforwardsolve'
; ││┌ @ fastmath.jl:163 within mul_fast'
vmulss	%xmm4, %xmm2, %xmm2
; │└└
; │┌ @ fastmath.jl:162 within pdforwardsolve'
vfmsub231ss	%xmm3, %xmm5, %xmm2
; │└
; │┌ @ In:13 within pdforwardsolve'
; ││┌ @ fastmath.jl:169 within mul_fast' @ fastmath.jl:163
vmulss	%xmm1, %xmm6, %xmm3
; │└└
; │┌ @ fastmath.jl:161 within pdforwardsolve'
vmulss	%xmm0, %xmm3, %xmm0
; │└
; │┌ @ In:15 within pdforwardsolve'
; ││┌ @ fastmath.jl:162 within sub_fast'
vsubss	%xmm7, %xmm1, %xmm1
vmulss	%xmm1, %xmm6, %xmm1
; │└└
; │┌ @ broadcast.jl:800 within materialize!'
; ││┌ @ broadcast.jl:841 within copyto!' @ broadcast.jl:885
; │││┌ @ broadcast.jl:868 within preprocess'
; ││││┌ @ broadcast.jl:872 within preprocess_args'
; │││││┌ @ broadcast.jl:869 within preprocess'
vmovss	%xmm4, (%rsp)
vmovss	%xmm1, 4(%rsp)
vmovss	%xmm0, 8(%rsp)
; │└└└└└
; │┌ @ range.jl:595 within iterate'
; ││┌ @ promotion.jl:403 within =='
addq	$-1, %r12 jne L320 ; │└└ L479: movabsq$140262417817608, %rax  # imm = 0x7F916396E008
addq	$24, %rsp popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp retq nopl (%rax,%rax) ; └  All we see are xmm registers accompanied with instructions that have an “s” in the second-last position, meaning “single” element, not “packed” vector. So the code is not vectorized at all. For good measure, let’s add an @simd ivdep. The ivdep makes a few promises to the compiler that each iteration of the for loop is independent$-$that one iteration will not dependent on the result from a previous one. In : function juliatest_2!(Y::AbstractMatrix, BPP::AbstractMatrix) @inbounds @simd ivdep for i ∈ 1:size(BPP,2) Y[:,i] .= pdforwardsolve( BPP[1,i],BPP[2,i],BPP[3,i], BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i] ) end end @benchmark juliatest_2!($Y, $BPP)  Out: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 19.464 μs (0.00% GC) median time: 19.560 μs (0.00% GC) mean time: 19.603 μs (0.00% GC) maximum time: 35.195 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 No luck. In this case, we can look at pdforwardsolve, and see that while there is a lot of math, many of those calculations within the function depend on earlier answers. If we need one answer before getting to calculate the next, we can’t calculate them at the same time! That more or less means that calculating one instance of pdforwardsolve cannot be vectorized. But, wait! What about instead just calculating a bunch of pdforwardsolve at in parallel, at the same time (on a single core)? So, let’s try swapping the indices, and using the transposed versions of$\textbf{Z}$and$\textbf{BPP}$: In : function juliatest_3!(Y::AbstractMatrix, BPP::AbstractMatrix) @inbounds @simd ivdep for i ∈ 1:size(BPP,1) Y[i,:] .= pdforwardsolve( BPP[i,1],BPP[i,2],BPP[i,3], BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10] ) end end @benchmark juliatest_3!($Y32, $BPP32)  Out: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 20.631 μs (0.00% GC) median time: 21.123 μs (0.00% GC) mean time: 21.047 μs (0.00% GC) maximum time: 45.611 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 Worse. This part of optimization can be tricky: you know what you want the compiler to do (vectorize), we know that it is possible, (why wouldn’t it be?), but the compiler isn’t cooperating. The problem is that Julia gets confused. I think what is happening is that, with ivdep, we promised that each of the arrays are distinct, and don’t overlap. That is, that Z and BPP are separate. But, we have a whole lot of referenecs to the same BPP, and we’re writing into a slice of Z. So the compiler isn’t convinced that these don’t overlap or conflict in some way. So let’s try and be a little more explicit. Instead of pointing to a bunch of different indices within the same two arrays, we’re taking a view of each of the columns of these arrays. A view, like the name implies, is just a look at the parent. Meaning editing Y1 edits the first column of Y, because Y1 is the first column of Y. But splitting everything appart in this way makes it more obvious that all these columns are in fact, separate. In : function juliatest_4!(Y::AbstractMatrix{T}, Data::AbstractMatrix{T}) where T @inbounds begin Y1, Y2, Y3 = @views Y[:,1], Y[:,2], Y[:,3] X1, X2, X3 = @views Data[:,1], Data[:,2], Data[:,3] S_11, S_12, S_22 = @views Data[:,5], Data[:,6], Data[:,7] S_13, S_23, S_33 = @views Data[:,8], Data[:,9], Data[:,10] end @inbounds @simd ivdep for i ∈ 1:size(Data,1) (y1, y2, y3) = pdforwardsolve( X1[i], X2[i], X3[i], S_11[i], S_12[i], S_22[i], S_13[i], S_23[i], S_33[i] ) Y1[i] = y1 Y2[i] = y2 Y3[i] = y3 end end @benchmark juliatest_4!($Y32, $BPP32)  Out: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 2.865 μs (0.00% GC) median time: 2.897 μs (0.00% GC) mean time: 2.898 μs (0.00% GC) maximum time: 4.657 μs (0.00% GC) -------------- samples: 10000 evals/sample: 9 So, now the compiler is suddenly confident enough to vectorize the code, and it’s 6 times faster! Nice. Before seeing if we can make this even faster, I wanted to comment on something else necessary: To vectorize, you have to do the same thing to all elements. That means every “branch” where code could take more than one possible path, like in an “if, then” statement, causes a problem. It is possible to get around this, but it causes problems, and will often stop vectorization. That might not seem like a big deal$-$our code didn’t have any obvious branches, right? Well… In : sqrt(2.3)  Out: 1.51657508881031 In : sqrt(-2.3)  DomainError with -2.3: sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)). Stacktrace:  throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:31  sqrt(::Float64) at ./math.jl:492  top-level scope at In:1 The code just did one of two different things. It either returned a number, or through an error. Or: In : Y32[1000,1]  Out: 0.28272918f0 In : Y32[10000,1]  BoundsError: attempt to access 2827×3 Array{Float32,2} at index [10000, 1] Stacktrace:  getindex(::Array{Float32,2}, ::Int64, ::Int64) at ./array.jl:730  top-level scope at In:1 Again, one of two different things. These are branches, either of which can kill vectorization. But the square root is especially insidious. For the bounds check, compilers are often smart enough to realize that if the very first and very last bounds check are okay (or maybe only the last), then all of them must be okay. If they notice that, they’ll only make the single check, and everything is okay. Alternatively, @inbounds tells the compiler not to check the boundaries of the array, and just trust everything is within bounds. The sqrt is harder, because how is the compiler supposed to know anything about the numbers that you’re sending in to that function? There isn’t a simple precheck it can use to figure out if things are going to be okay. A solution there is @fastmath, which tells the compiler (among other things) not to worry about checking square roots: In : @fastmath sqrt(-2.3)  Out: NaN No branch, instead it returns the double precision number NaN In : typeof(NaN)  Out: Float64 otherwise, acting exactly as it normally does. While we want both @inbounds and @fastmath while optimizing code, we’d prefer neither while debugging. Errors make it easier to track down where a problem came from. sqrts silently introducing infectious NaNs can be unpleasant to track down. Even worse, Julia is likely to outright crash with a segmentation fault from @inbounds, when the memory really was not inbounds. Note that @inbounds only works inside of functions, so you can’t just try @inbounds [1,2] in the REPL. These are tradeoffs and considerations you have to consider in every language. For example, in C++, array boundaries are not checked by default$-$you’d just get the crash! Finally, the @inline in front of the definition of pdforwardsolve tells the compiler that when you write another function calling pdforwardsolve, instead of actually calling a compiled pdforwardsolve, you want to insert the code. The compiler will often decide to do that automatically, even without @inline, but it is important to make sure that is happening when you vectorize code. This is because, for the for loop to get vectorized, the compiler has to know what it is vectorizing, and generate that vectorized code. If all it has to work with is an opaque “call some other function named pdforwardsolve“, the compiler wouldn’t know what to do with that. I had written this code in a few different languages, including C++, Fortran, Rust, and ispc. Anyway, C++ code doing the same thing: #include <math.h> inline void fpdforwardsolve(float* __restrict Z, float* __restrict BPP, int n, int N){ float x1 = BPP[n ]; float x2 = BPP[n + N]; float x3 = BPP[n + 2*N]; float S11 = BPP[n + 4*N]; float S12 = BPP[n + 5*N]; float S22 = BPP[n + 6*N]; float S13 = BPP[n + 7*N]; float S23 = BPP[n + 8*N]; float S33 = BPP[n + 9*N]; float R11 = 1 / sqrtf(S11); float R11x1 = R11 * x1; float L21 = R11 * S12; float L31 = R11 * S13; float R22 = 1 / sqrtf(S22 - L21*L21); float L32 = R22 * (S23 - L21*L31); float R33 = 1 / sqrtf(S33 - L31*L31 - L32*L32); float nR21x1 = R22 * L21*R11x1; float R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 ); float nR32 = R33 * L32 * R22; Z[n ] = R11x1; Z[n + N] = R22*x2 - nR21x1; Z[n + 2*N] = R31x1 - nR32*x2 + R33*x3; } extern "C" { void processBPP(float* __restrict Z, float* __restrict BPP, int N){ for (int n = 0; n < N; n++){ fpdforwardsolve(Z, BPP, n, N); } } }  If you want to treat it as C code: 1. Get rid of the extern "C" { } surrounding the function processBPP. That line tells a C++ compiler to externally represent that function as though it were C. C automatically externally represents its functions as though they were C. 2. Delete the double underscores in front of the __restricts, replacing them with a simply restrict. I compiled this code with the following two compiler commands, to test two different compilers: g++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test.so vectorization_test.cpp and clang++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libclangvectorization_test.so vectorization_test.cpp Run down of the commands: 1. g++/clang++: the name of the compilers. g++ is from the GNU compiler collection, while clang is an “LLVM”-based C++ compiler. The LLVM project is a collection of tools meant to be modular. To get an idea of the aims and goals of LLVM, the first option under LLVM Tutorial is a tutorial for how to create your own language using LLVM. Surprise, surprise: Julia is also built on LLVM. As are languages like Swift, Rust, and ispc. 2. -Ofast: this turns on a lot of optimizations, as well as @fastmath, turning off errors from math functions, and treats floating point math as though it were associative. 3. -march=native: optimize the code to run fast on the computer compiling the code. And it is okay if the code cannot run on other computer. If you plan on letting people download and run your compiled code, you can’t use this. 4. -shared -fPIC we want a shared library (requires -fPIC). Languages like Julia and R can call compiled shared libraries (.dll on Windows, .dylib on Macs, and .so on Linux). 5. -mprefer-vector-width=512 gcc in particular doesn’t like using 512-bit vectors unless asked. LLVM is better about this. Don’t use this if your computer doesn’t support 512-bit vectors. 6. -o libclangvectorization_test.so name of the output shared library we are creating. Change the suffix based on your operating system. 7. vectorization_test.cpp the name of the file I saved the C++ code as. To call the C++ code from Julia, all we need is what follows. We test to make sure we get the same answers as from Julia before we benchmark, because getting the right answer is more important than speed. In : const cpptest_dir = "/home/chriselrod/Documents/progwork/julia/probcol/processBPP_tests" const clanglib = joinpath(cpptest_dir, "libclangvectorization_test.so") const gpplib = joinpath(cpptest_dir, "libgppvectorization_test.so") function gpptest(Y, BPP, N) ccall((:processBPP, gpplib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[]) end function clangtest(Y, BPP, N) ccall((:processBPP, clanglib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[]) end N32 = Int32(N) fill!(Y, 0); juliatest_1!(Y, BPP); Y  Out: 3×2827 Array{Float32,2}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312362 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : fill!(Y32, 0); juliatest_4!(Y32, BPP32); Y32'  Out: 3×2827 Adjoint{Float32,Array{Float32,2}}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312362 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : fill!(Y32, 0); gpptest(Y32, BPP32, N32); Y32'  Out: 3×2827 Adjoint{Float32,Array{Float32,2}}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312363 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : fill!(Y32, 0); clangtest(Y32, BPP32, N32); Y32'  Out: 3×2827 Adjoint{Float32,Array{Float32,2}}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312362 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : @benchmark gpptest($Y32, $BPP32,$N32)

Out:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     3.521 μs (0.00% GC)
median time:      3.529 μs (0.00% GC)
mean time:        3.551 μs (0.00% GC)
maximum time:     6.469 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     8
In :
@benchmark clangtest($Y32,$BPP32, $N32)  Out: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.565 μs (0.00% GC) median time: 1.616 μs (0.00% GC) mean time: 1.604 μs (0.00% GC) maximum time: 4.663 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 Wow! Clang smoked Julia and g++. What is going on? Looking at the assembly reveals a trick both g++ and Clang are using, but Julia is not. Clang is using LLVM version 7, and Julia is LLVM version 6. Once Julia upgrades, we should see it closer Clang. The trick is in our square roots, eg:  float R11 = 1 / sqrtf(S11); Julia is calculating the vector of R11 values by taking the square root of a vector of S11 values, and then dividing a vector of 1s by this vector of square roots. However, the the CPU I am running this code on has an innacurate instruction called vrsqrt14ps. This instruction calculates an approximation to the recirprical of a square root, and is extremely fast. Unfortunately, it is just an approximation, with relative error of$10^{-14}$. So instead of using this value directly, it performs a Newton-Raphson iteration to cut down the error: rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) Now, I said both g++ and Clang use this trick. Then why was g++ slower? Because of a bug, where for my CPU the compiler doesn’t realize it can use this rsqrt approximation. Instead, it does the same thing Clang does to approximate the reciprocal square root…and then multiply that by the original number, to approximate the square root, and then it inverts that approximation to the square root using some other convoluted scheme. Yikes. If we instead compile with: g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test2.so vectorization_test.cpp we don’t give it permission to approximate that square root calculation (while still turning off errors, to let the code be vectorized): In : const gpplib2 = joinpath(cpptest_dir, "libgppvectorization_test2.so") function gpptest2(Y, BPP, N) ccall((:processBPP, gpplib2), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[]) end fill!(Y32, 0); gpptest2(Y32, BPP32, N32); Y32'  Out: 3×2827 Adjoint{Float32,Array{Float32,2}}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312363 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : @benchmark gpptest2($Y32, $BPP32,$N32)

Out:
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     2.868 μs (0.00% GC)
median time:      2.882 μs (0.00% GC)
mean time:        2.889 μs (0.00% GC)
maximum time:     4.729 μs (0.00% GC)
--------------
samples:          10000
evals/sample:     9

So now it is the same fast as Julia, because it’s doing the same thing.

But hmm. We just realized that Clang (thanks to LLVM 7 $-$ LLVM 6 didn’t do that!) is faster than Julia. What can we do?
One choice is to wait until Julia upgrades to LLVM 7. Nah.
Another is to try and compile Julia from source to use LLVM 7. I don’t know enough about Julia’s internals to upgrade it myself. Julia needs a bunch of patches, so it requires a whole lot more work than just telling it to download a different version while building.

So, what does that leave us with?
Well, for starters, we can write our own rsqrt function. In my library SIMDPirates, I have:

@inline function rsqrt_fast(x::NTuple{16,Core.VecElement{Float32}})
Base.llvmcall(
""" %rs = call <16 x float> asm "vrsqrt14ps \$1, \$0", "=x,x"(<16 x float> %0)
ret <16 x float> %rs""",
NTuple{16,Core.VecElement{Float32}}, Tuple{NTuple{16,Core.VecElement{Float32}}}, x)
end
@inline function rsqrt(x::NTuple{16,Core.VecElement{Float32}})
r = rsqrt_fast(x)
# Performs a Newton step to increase accuracy.
ns = vfma(vmul(r,r), x, -3.0f0)
vmul(vmul(-0.5f0, r), ns)
end
@inline rsqrt_fast(x::AbstractStructVec) = SVec(rsqrt_fast(extract_data(x)))
@inline rsqrt(x::AbstractStructVec) = SVec(rsqrt(extract_data(x)))
@inline rsqrt_fast(x) = @fastmath inv(sqrt(x))
@inline rsqrt(x) = @fastmath inv(sqrt(x))


rsqrt_fast calls assembly (asm) instruction vrsqrt14ps, and then rsqrt calls this and adds the Newton Raphson iteratin. Nice!

But, then what about actually vectorizing a for loop? This function would have to be called with an actual vector of 16 32-bit floats to use that instruction. (And what about other functions where it is hard to convince the compiler to vectorize them?)

Well, I wrote a library LoopVectorization that forcably vectorizes some kinds of for loops. Trying it out here:

In :
using SIMDPirates, LoopVectorization

@inline function pdforwardsolve_nofm(x1,x2,x3,S11,S12,S22,S13,S23,S33)
R11 = rsqrt(S11)
R11x1 = R11 * x1
L21 = R11 * S12
L31 = R11 * S13
R22 = rsqrt(S22 - L21*L21)
L32 = R22 * (S23 - L21 * L31)
R33 = rsqrt(S33 - L31*L31 - L32*L32)

nR21x1 = R22 * L21 * R11x1
R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
nR32 = R33 * L32 * R22

(
R11x1,
R22*x2 - nR21x1,
R31x1 - nR32*x2 + R33*x3
)
end

@generated function juliatest!(X::AbstractMatrix{T}, BPP::AbstractMatrix{T}) where T
quote
@vectorize $T for i ∈ 1:size(BPP,1) X[i,:] .= pdforwardsolve_nofm( BPP[i,1],BPP[i,2],BPP[i,3], BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10] ) end end end fill!(Y32, 0); juliatest!(Y32, BPP32); Y32'  Out: 3×2827 Adjoint{Float32,Array{Float32,2}}: 1.85509 0.678078 0.870311 … -0.300907 -0.434875 -0.532785 -0.553233 -0.0312363 0.154491 -0.300106 -1.71175 0.23169 1.24083 0.415602 -0.63614 1.56528 -1.70304 -1.38821  In : @benchmark juliatest!($Y32, $BPP32)  Out: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.764 μs (0.00% GC) median time: 1.773 μs (0.00% GC) mean time: 1.790 μs (0.00% GC) maximum time: 4.412 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 Not quite as good as Clang, but close! We can use the macro @macroexpand to expand the @vectorize macro and see what it does. We’ll also use striplines from the MacroTools library, because this stripes out a bunch of debugging information that clutters up the expansion. In : using MacroTools: striplines (@macroexpand @vectorize Float32 for i ∈ 1:size(BPP,1) X[i,:] .= pdforwardsolve( BPP[i,1],BPP[i,2],BPP[i,3], BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10] ) end) |> striplines  Out: quote ##N#568 = size(BPP, 1) (##Q#585, ##rem#586) = (LoopVectorization.VectorizationBase).size_loads(BPP, 1, Val{16}()) ##pX#569 = vectorizable(X) ##pBPP#574 = vectorizable(BPP) begin ##stride#573 = LoopVectorization.stride_row(X) * sizeof(eltype(##pX#569)) ##stride#575 = LoopVectorization.stride_row(BPP) * sizeof(eltype(##pBPP#574)) end begin for ##i#567 = 0:64:##Q#585 * 64 - 4 begin ####pBPP#574_i#576 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 0##stride#575) ####pBPP#574_i#577 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 1##stride#575) ####pBPP#574_i#578 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 2##stride#575) ####pBPP#574_i#579 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 4##stride#575) ####pBPP#574_i#580 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 5##stride#575) ####pBPP#574_i#581 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 6##stride#575) ####pBPP#574_i#582 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 7##stride#575) ####pBPP#574_i#583 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 8##stride#575) ####pBPP#574_i#584 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 9##stride#575) begin ##B#570 = LoopVectorization.extract_data.(pdforwardsolve(####pBPP#574_i#576, ####pBPP#574_i#577, ####pBPP#574_i#578, ####pBPP#574_i#579, ####pBPP#574_i#580, ####pBPP#574_i#581, ####pBPP#574_i#582, ####pBPP#574_i#583, ####pBPP#574_i#584)) for ##j#572 = 0:SIMDPirates.vsub(length(##B#570), 1) begin$(Expr(:inbounds, true))
$(Expr(:inbounds, :pop)) #473#val end end end end end end begin if ##rem#586 > 0 ##mask#587 = one(UInt16) << ##rem#586 - one(UInt16) ##i#567 = (##N#568 - ##rem#586) * 4 begin ####pBPP#574_i#576 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 0##stride#575, ##mask#587) ####pBPP#574_i#577 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 1##stride#575, ##mask#587) ####pBPP#574_i#578 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 2##stride#575, ##mask#587) ####pBPP#574_i#579 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 4##stride#575, ##mask#587) ####pBPP#574_i#580 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 5##stride#575, ##mask#587) ####pBPP#574_i#581 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 6##stride#575, ##mask#587) ####pBPP#574_i#582 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 7##stride#575, ##mask#587) ####pBPP#574_i#583 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 8##stride#575, ##mask#587) ####pBPP#574_i#584 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 9##stride#575, ##mask#587) begin ##B#570 = LoopVectorization.extract_data.(pdforwardsolve(####pBPP#574_i#576, ####pBPP#574_i#577, ####pBPP#574_i#578, ####pBPP#574_i#579, ####pBPP#574_i#580, ####pBPP#574_i#581, ####pBPP#574_i#582, ####pBPP#574_i#583, ####pBPP#574_i#584)) for ##j#572 = 0:SIMDPirates.vsub(length(##B#570), 1) begin$(Expr(:inbounds, true))
\$(Expr(:inbounds, :pop))
#474#val
end
end
end
end
end
end
end

Odds are everything isn’t suddenly clear at a glance.
But what is happening is that it takes our for loop, and unrolls it, so that it can handle many iterations at a time with vectors. It also takes care of loading and storing these vectors from vectorizable forms of our input arrays.

Finally, because not all numbers are divisible by [length of the vectors], it handles the remaining iterations with a masked-operations. Meaning, if the for loop covered 106 iterations and the vectors were of length 16:

1. The for loop would repeat 6 times.
2. The last 10 elements would be loaded with a masked load, operations would be performed, and then the answers written with a masked store. The “masks” hide the excess elements, so that we don’t read or write from memory Julia does not have access to (which could trigger a segmentation fault).

Additionally, in forcing vectorization, we don’t need the @fastmath flag. Actually, that flag currently makes the code slower. I’ll fix that sometime.

This macro is a final hammer that can force vectorization.
Other instances where it works is when loops contain functions like log, exp, or sin, which (unlike sqrt`) aren’t natively supported on most CPUs.