GibbsVectorizationIntro








Before we try to hand-optimize the Gibbs model, let’s talk some about how to actually optimize code. Or at least hit a few basics.

Here I wanted to write a short introduction to writing fast code.

For code to run fast, there are a few basic principles:

  1. More efficient algorithm. This is where the biggest wins are possible!
  2. Don’t compute anything you don’t have to. Don’t compute anything more than once (unless it’s faster/cheaper to re-compute than to store and look up).
  3. Parallelism. CPU clock speeds have largely stopped increasing in recent years. Instead, recent increases in CPU throughput have come in the form of parallelism. Parallelism takes many forms, the most obvious which is multiple separate CPU cores. Another major source of speed up is SIMD (Single Instruction Multiple Data), where a single instruction from one CPU core can act on multiple pieces of data simultaneously. Most recent CPUs have 256 bit registers (where the data is held while the CPU can operate on it). A standard double precision number is 64 bits, meaning these CPUs can potentially operate on 4 numbers at a time. A few CPUs have already upgraded to 512 bit registers, potentially acting on 8 double precision numbers or 16 single precision numbers per opertion.
  4. Memory is slow! Consider this benchmark:
In [1]:
using BenchmarkTools, SIMDArrays
A = @Static randn(8,13)
B = @Static randn(13,10)
@benchmark foreach(_ -> $A * $B, 1:1000)
Out[1]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.761 μs (0.00% GC)
  median time:      19.007 μs (0.00% GC)
  mean time:        19.051 μs (0.00% GC)
  maximum time:     55.397 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [2]:
# 299_792_458 speed of light in m/s * 25 ns * 1 s / 1e9 ns = meters light travels
299_792_458  * 19 * 1 / 1e9
Out[2]:
5.696056702

I repeated the matrix multiplication benchmark 1000 times to get more accurate timings.

Light travels 299,792,458 meters per second. That means that light can only travel about 5.7 meters in the time it takes to multiply an $8 \times 12$ and a $12 \times 10$ matrix.

That computation requires taking $8 * 10 = 80$ dot products with vectors of length 13. Each of these dot products requires 13 multiplications and 12 additions. That means that a single CPU core did 1040 multiplications and 960 additions in the time it took light to travel 5.7 meters, less than the width of a typical office.

Each core of a computer can churn through data blisteringly fast.
The CPU’s speed is battling the laws of physics. But memory is much slower. Figuring out where the memory actually is that we’d like to access, and moving it to the CPU is extremely slow compared to how fast the CPU is at handling memory that is actually there. RAM can be far away.

An example of where this can be a problem, consider a simple dot product.

In [3]:
using LinearAlgebra
BLAS.set_num_threads(1)
x = rand(128); y = rand(128);
@btime foreach(_ -> $x'* $y, 1:10^6)
  14.562 ms (0 allocations: 0 bytes)
In [4]:
x = rand(128 * 10^6); y = rand(128 * 10^6);
@btime $x'* $y
  154.016 ms (0 allocations: 0 bytes)
Out[4]:
3.200391627717342e7
In [5]:
156.32 / 14.843
Out[5]:
10.531563700060634

Multiplying the length of a vector by a million makes the dot product more than 10 million times slower!
More or less, that means for every microsecond the CPU spent working, it spent at least 9 microseconds just waiting for data to show up!

I won’t focus on this much in this post, but it’s something to be mindful of while trying to write fast code.
Try not to jump around in memory much. CPUs have different levels of cache memory. These caches are much smaller than RAM, but they’re much faster. Typically, L3 is large and shared among cores, while each core gets their own L2 and L1 cache. The L1 cache is the smallest, but also the closest to the action and the fastest.
On my computer, each core has 32 KiB of L1 data cache, meaning it can store roughly 32 KiB * 1024 B/KiB / (8 B / 64-bit float) = 4096 64 bit floats in the L1 cache. Breaking bigger problems down and processing them in chunks small enough to fit in the L1 cache, eg using less than 4096 doubles at a time, will be faster.

Many optimized libraries and aglorithms do this. For example, optimized matrix multiplication routines do this for via multiplying big matrices block-wise, where these blocks are small enough to fit in the L1 cache.

My focus here will be on the SIMD (Single Instruction Multiple Data) instructions. This is also often known as vectorization, because each instruction operates on a “vector” of data. This can be confusing for those coming from R, MATLAB, or Python, where “vectorization” often means “not using a for loop”.
Vectorizing in languages with great compilers, such as Julia, C++, or Fortran, means having the compiled generate SIMD instructions.
Many “vectorized” functions from the former sets of languages will also often really be vectorized in the sense of the latter, because they’re likely to be calling compiled C/C++/Fortran code.

Within the compiled languages, “for loops” often are vectorized. For example,

In [6]:
function my_dot_product(a, b)
    d = 0.0
    @inbounds @simd for i  eachindex(a, b)
        d += a[i] * b[i]
    end
    d
end
x = rand(128); y = rand(128)
@btime $x' * $y
  16.519 ns (0 allocations: 0 bytes)
Out[6]:
34.05424279632278
In [7]:
@btime my_dot_product($x, $y)
  7.090 ns (0 allocations: 0 bytes)
Out[7]:
34.054242796322775

We get the same answer as the “builtin” bar rounding error, but our function is actually faster! That’s because the base dot product actually calls an external BLAS library instead of native Julia code. The advantage of the external library is that it is multithreaded (so long as we don’t set BLAS.set_num_threads(1)), which speeds it up for really big vectors. Memory throughput should be higher when there’s more cores.
For vectors too small to benefit, it’s faster not to call an external library.

We can look at the assembly instructions of my_dot_product with @code_native.
When looking for vecorization, the things to look for are vector registers. xmm registers are 128 bits, ymm registers are 256 bits, and zmm registers are 512 bits. An instruction will often look like this:

vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0

the instruction is named vfmadd231. The fmadd says what it is doing. fmadd means fused-multiply-add, meaning it is doing both an addition and a multiplication. add does addition, sub subtraction, mul multiplication, div division, etc. mov means to move memory from one place to another.

The second last letter, p, means “packed”, meaning that it is operating on a vector “packed” with numbers. It could alternatively say s, which means it is only operating on a single number, even if the register could hold an entire vector.
The d means “double precision”, meaning the operations is multiplying and adding double precision (64 bit) floating point numbers. Remember that the zmm register holds 512 bits — that’s just 512 zeros/ones. That last letter of the instruction specifies how exactly those zeros and ones are being interpreted. q means integer, and s means single precision.

This instruction has 3 inputs (but many have only 1 or 2): it is multiplying numbers from a region in memory (%rdx,%rdi,8) with the register zmm4, and adding the result to zmm0.
In AT&T syntax (which this is), the answer gets stored on the right most of the registers that follow, meaning zmm0 gets overwritten with the sum of itself, and the product of (%rdx,%rdi,8) and %zmm4. That is:

%zmm0 = (%rdx,%rdi,8) * %zmm4 + %zmm0

Because zmm registers hold 8 doubles, this single instruction does 8 multiplications and 8 additions.

Without further adieu:

In [8]:
@code_native my_dot_product(x, y)
	.text
; ┌ @ In[6]:3 within `my_dot_product'
; │┌ @ simdloop.jl:65 within `macro expansion'
; ││┌ @ abstractarray.jl:209 within `eachindex' @ abstractarray.jl:219 @ abstractarray.jl:216
; │││┌ @ abstractarray.jl:95 within `axes1'
; ││││┌ @ abstractarray.jl:75 within `axes'
; │││││┌ @ In[6]:2 within `size'
	subq	$40, %rsp
	movq	24(%rdi), %rcx
; │││││└
; │││││┌ @ tuple.jl:125 within `map'
; ││││││┌ @ range.jl:318 within `Type' @ range.jl:309
; │││││││┌ @ promotion.jl:414 within `max'
	movq	%rcx, %rax
	sarq	$63, %rax
	andnq	%rcx, %rax, %rax
; │││└└└└└
; │││ @ abstractarray.jl:209 within `eachindex' @ array.jl:155
	movq	24(%rsi), %r8
; │││ @ abstractarray.jl:209 within `eachindex' @ abstractarray.jl:220
; │││┌ @ abstractarray.jl:226 within `_all_match_first'
; ││││┌ @ abstractarray.jl:220 within `#91'
; │││││┌ @ abstractarray.jl:216 within `eachindex'
; ││││││┌ @ abstractarray.jl:95 within `axes1'
; │││││││┌ @ abstractarray.jl:75 within `axes'
; ││││││││┌ @ tuple.jl:125 within `map'
; │││││││││┌ @ range.jl:318 within `Type' @ range.jl:309
; ││││││││││┌ @ promotion.jl:414 within `max'
	movq	%r8, %rdx
	sarq	$63, %rdx
	andnq	%r8, %rdx, %rdx
; ││││└└└└└└└
; ││││┌ @ range.jl:718 within `==' @ promotion.jl:403
	cmpq	%rdx, %rax
; │││└└
	jne	L300
; │││ @ abstractarray.jl:209 within `eachindex' @ abstractarray.jl:219 @ abstractarray.jl:216
; │││┌ @ abstractarray.jl:95 within `axes1'
; ││││┌ @ abstractarray.jl:75 within `axes'
; │││││┌ @ tuple.jl:125 within `map'
; ││││││┌ @ range.jl:318 within `Type' @ range.jl:309
; │││││││┌ @ promotion.jl:414 within `max'
	testq	%rcx, %rcx
; ││└└└└└└
; ││ @ simdloop.jl:68 within `macro expansion'
	jle	L73
	movq	(%rdi), %rcx
	movq	(%rsi), %rdx
; ││ @ simdloop.jl:71 within `macro expansion'
	cmpq	$32, %rax
	jae	L85
; └└
; ┌ @ simdloop.jl within `my_dot_product'
	vxorpd	%xmm0, %xmm0, %xmm0
	xorl	%esi, %esi
	jmp	L272
L73:
	vxorps	%xmm0, %xmm0, %xmm0
; └
; ┌ @ In[6]:6 within `my_dot_product'
	addq	$40, %rsp
	vzeroupper
	retq
; │ @ In[6]:3 within `my_dot_product'
; │┌ @ simdloop.jl:71 within `macro expansion'
L85:
	movabsq	$9223372036854775776, %rsi # imm = 0x7FFFFFFFFFFFFFE0
	andq	%rax, %rsi
	vxorpd	%xmm0, %xmm0, %xmm0
	xorl	%edi, %edi
	vxorpd	%xmm1, %xmm1, %xmm1
	vxorpd	%xmm2, %xmm2, %xmm2
	vxorpd	%xmm3, %xmm3, %xmm3
	nopw	%cs:(%rax,%rax)
; ││ @ simdloop.jl:73 within `macro expansion' @ In[6]:4
; ││┌ @ array.jl:729 within `getindex'
L128:
	vmovupd	(%rcx,%rdi,8), %zmm4
	vmovupd	64(%rcx,%rdi,8), %zmm5
	vmovupd	128(%rcx,%rdi,8), %zmm6
	vmovupd	192(%rcx,%rdi,8), %zmm7
; ││└
; ││┌ @ float.jl:395 within `+'
	vfmadd231pd	(%rdx,%rdi,8), %zmm4, %zmm0
	vfmadd231pd	64(%rdx,%rdi,8), %zmm5, %zmm1
	vfmadd231pd	128(%rdx,%rdi,8), %zmm6, %zmm2
	vfmadd231pd	192(%rdx,%rdi,8), %zmm7, %zmm3
; │└└
; │┌ @ int.jl:53 within `macro expansion'
	addq	$32, %rdi
	cmpq	%rdi, %rsi
	jne	L128
; │└
; │┌ @ simdloop.jl:73 within `macro expansion' @ In[6]:4
; ││┌ @ float.jl:395 within `+'
	vaddpd	%zmm0, %zmm1, %zmm0
	vaddpd	%zmm0, %zmm2, %zmm0
	vaddpd	%zmm0, %zmm3, %zmm0
	vextractf64x4	$1, %zmm0, %ymm1
	vaddpd	%zmm1, %zmm0, %zmm0
	vextractf128	$1, %ymm0, %xmm1
	vaddpd	%zmm1, %zmm0, %zmm0
	vpermilpd	$1, %xmm0, %xmm1 # xmm1 = xmm0[1,0]
	vaddpd	%zmm1, %zmm0, %zmm0
	cmpq	%rsi, %rax
; └└└
; ┌ @ simdloop.jl:71 within `my_dot_product'
	je	L292
	nopw	%cs:(%rax,%rax)
; └
; ┌ @ In[6]:3 within `my_dot_product'
; │┌ @ simdloop.jl:73 within `macro expansion' @ In[6]:4
; ││┌ @ array.jl:729 within `getindex'
L272:
	vmovsd	(%rcx,%rsi,8), %xmm1    # xmm1 = mem[0],zero
; ││└
; ││┌ @ float.jl:395 within `+'
	vfmadd231sd	(%rdx,%rsi,8), %xmm1, %xmm0
; ││└
; ││ @ simdloop.jl:74 within `macro expansion'
; ││┌ @ int.jl:53 within `+'
	addq	$1, %rsi
; ││└
; ││ @ simdloop.jl:71 within `macro expansion'
; ││┌ @ int.jl:49 within `<'
	cmpq	%rax, %rsi
; ││└
	jb	L272
; │└
; │ @ In[6]:6 within `my_dot_product'
L292:
	addq	$40, %rsp
	vzeroupper
	retq
; │ @ In[6]:3 within `my_dot_product'
; │┌ @ simdloop.jl:65 within `macro expansion'
; ││┌ @ abstractarray.jl:209 within `eachindex' @ abstractarray.jl:220
L300:
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 8(%rsp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 16(%rsp)
	movq	%rdi, 24(%rsp)
	movq	%rsi, 32(%rsp)
	movabsq	$jl_invoke, %rax
	movabsq	$140261687004432, %rdi  # imm = 0x7F9138078D10
	leaq	8(%rsp), %rsi
	movl	$4, %edx
	callq	*%rax
	ud2
	nopw	%cs:(%rax,%rax)
; └└└

That is a lot of output, but the most important part:

L128:
    vmovupd (%rcx,%rdi,8), %zmm4
    vmovupd 64(%rcx,%rdi,8), %zmm5
    vmovupd 128(%rcx,%rdi,8), %zmm6
    vmovupd 192(%rcx,%rdi,8), %zmm7
; ││└
; ││┌ @ float.jl:395 within `+'
    vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0
    vfmadd231pd 64(%rdx,%rdi,8), %zmm5, %zmm1
    vfmadd231pd 128(%rdx,%rdi,8), %zmm6, %zmm2
    vfmadd231pd 192(%rdx,%rdi,8), %zmm7, %zmm3
; │└└
; │┌ @ int.jl:53 within `macro expansion'
    addq    $32, %rdi
    cmpq    %rdi, %rsi
    jne L128

The vmovupd instruction loads data from memory into a register. Once these are loaded, it uses the vfmadd instructions to multiply them with more numbers from a register, and update cumulative sums.
It does each of these things four times total. Meaning it loads does 32 multiplications and 32 additions.

Then it increments a loop counter (\%rdi) with addq, then cmpq (compare-integers) to check if the loop is over, and finally conditionally jumps (jne) back to L128 (location marked 128) or continues on in the code. L128 is the start of the loop, just before memory was loaded. Meaning if it jumps back, the loop, loops.

So hold on. We wrote a for loop, where each iteration we said get 1 element from both arrays, multiply the two of tem, and add them to the cumulative sum. But this assembly code is instead saying, do that to 32 numbers per loop iteration!
That’s not what our code said!

Tricks like this are a part of why languages like C++ are fast: the compiler sees what you wrote, but then tells the computer to do something that will get the same answer, but is much faster.

Actually, that is not quite true here. Floating point arithmetic is not associative. That means that $(a + b) + c$ does not not necessarilly equal $a + (b + c)$, due to rounding errors.

For example:

In [9]:
1e18 + 65.0 - 1e18 - 65.0
Out[9]:
63.0
In [10]:
foo(a, b) = a + b - a - b
foo(1e18, 65)
Out[10]:
63.0
In [11]:
foo(1e18, 63)
Out[11]:
-63.0

You’re likely to get the same result in any language. Why?

Because the machine error at a number as massive as 1e18 is 128.

In [12]:
eps(1e18)
Out[12]:
128.0

This means that the difference between 1e18 and the next two closest numbers 64 bit floats can represent is 128.

In [13]:
nextfloat(1e18) - 1e18
Out[13]:
128.0
In [14]:
1e18 - prevfloat(1e18)
Out[14]:
128.0

Therefore, 1e18 + 65 rounds to 1e18 + 128. Then when we subtract 1e18, we get 128. Subtract 65, and we have 63.

Similarly, 1e18 + 63 rounds to 1e18. Subtract 1e18, and we have 0. Subtract 63, and we have -63.

Because associativity is not true for floating point numbers, compilers will not, by default, apply any associative-transforms to reorder floating point math. When in doubt, it is safer to assume you meant exactly what you wrote in the way you wrote it.

It would also be confusing if turning on compiler optimizations by itself suddenly changed the answer you got! Rounding errors are not always trivial. $\pm 63$ is not trivial!

An example of an algorithm written with the (lack of) associativity in mind, see Kahan summation. Kahan summation is a numerically accurate summation algorithm. It works by, while summing up your vector, also calculating and tracking the numerical erorr of your sum, and repeatedly adding this numerical eror back onto the sum to correct it.
If a compiler were allowed to assume that floating point math is associative, it would realize that the numerical error is (given associativity!) zero, and therefore delete all your code handling error. Then, suddenly, your accurate summation algorithm is a regular summation algorithm.

In our dot product, however, the compiler did assume associativity, and reorder our computations.
It didn’t calculate one product, add it to a single cumulative sum, and repeat. It calculated 32 products at a time, accumulating 32 separate sums. And only at the end, did it finally add these 32 sums together (plus add a second loop to cover any straglers, because most vectors’ length’s aren’t a product of 32).
While that method is actually more accurate than simply doing it in order with a single sum. But it isn’t what we wrote, and the compiler can’t know on it’s own that we were not intending to do something fancy like Kahan summation.
So we have to give the compiler permission to optimize, telling it that isn’t what we were doing. That is what the @simd in front of our for loop does.

Another macro that does this and applies to more than for loops is @fastmath:

In [15]:
fast_foo(a, b) = @fastmath a + b - a - b
fast_foo(1e18, 65.0)
Out[15]:
0.0
In [16]:
@code_native foo(1e18, 65.0)
	.text
; ┌ @ In[10]:1 within `foo'
; │┌ @ In[10]:1 within `+'
	vaddsd	%xmm1, %xmm0, %xmm2
; │└
; │┌ @ float.jl:397 within `-'
	vsubsd	%xmm0, %xmm2, %xmm0
	vsubsd	%xmm1, %xmm0, %xmm0
; │└
	retq
	nopl	(%rax)
; └
In [17]:
@code_native fast_foo(1e18, 65.0)
	.text
; ┌ @ In[15]:1 within `fast_foo'
	vxorps	%xmm0, %xmm0, %xmm0
	retq
	nopw	%cs:(%rax,%rax)
; └

the vxorps instruction does the xor operation. xor means, “or, but not and”. xor-ing anything with itself always equals zero because everything is “and” themselves. So while foo takes us literally and does exactly what we asked for, fast_foo did the algebra, relealized everything cancels out, and realized all it has to do is spit out “0”.

Also, integer arithmetic, unlike floating point arithmetic, is associative. So no promises needed:

In [18]:
@code_native foo(10^18, 65)
	.text
; ┌ @ In[10]:1 within `foo'
	xorl	%eax, %eax
	retq
	nopw	%cs:(%rax,%rax)
; └

Our regular, non-fastmath foo, when compiling a version for integers, also just xors to spit out a zero.

Another reason these compiled languages are faster is “2.” from our list: not doing anything you don’t have to, and just doing the things you need to once. eg, sorting out all the administrative details of what the input data is, and then what happens to it.
Figuring out what to do on the fly can take many times longer than just doing it, if “it” is fast.
In interpreted languages, like R and Python, each variable is a reference to data. Then when you want to do something with those variables, the interpreter has to “interpret” the meanings of everything as it happens, figuring out what the variables are, what it’s supposed to do with them, etc. We can emulate this from Julia like this:

In [19]:
x = Ref(1.0); y = Ref(1.0)
@btime x[] + y[]
  34.876 ns (3 allocations: 48 bytes)
Out[19]:
2.0

The most important factor of the above benchmark was leaving off the $ I used earlier each time I benchamrked. The $ lets the benchmark know the types of the variables I am benchmarking. This would let it emulate calling the benchmarked expression from within a function. Inside a function, Julia should (normally) be aware of the types of all the variables, which lets it sort out the sequence of computations it actually has to do ahead of time.

Now, 30-something nanoseconds isn’t bad on its own.
If R spends a 100ns interpreting things when you call rstan, who cares?

If all you’re doing is calling big functions or “vectorized” routines written in other languages, performance is fine. Actually, if that is all you’re doing, the interpreter is likely to spend a lot less time figuring out which already-compiled C code it should be running than Julia will compiling code before it gets to run it.
Try comparing how long it takes a fresh R and Julia session to load a plotting package and spit out the first plot.

The problem becomes apparent when you want to write fast routines yourself, building them out of lots of little instructions.
Imagine if, in our matrix multiplication, our 1040 multiplications and 960 additions each took 30+ ns. We certainly couldn’t have done it all in less than 20 ns!

Or, see what happens to a for loop:

In [20]:
x = rand(128); y = rand(128)
@btime (d = 0; for i  eachindex(x, y); d += x[i] * y[i]; end; d)
  9.736 μs (640 allocations: 11.98 KiB)
Out[20]:
31.624253451226192
In [21]:
@btime my_dot_product($x, $y)
  9.379 ns (0 allocations: 0 bytes)
Out[21]:
31.624253451226178

Scaling this up, you can imagine: if some Julia code takes 1 hour to run, it could take a month to run interpreted.
In cases like that, who cares about compiling for a second?

This can be important if you’d like to run length simulations, without being constrained to using tools that already exist. How often is the best tool for the job something that’s already there and available?
Of course, other tools like Rcpp exist to do this from within R, and popular packages like JAGS/NIMBLE and rstan will translate a DSL (domain-specific language) to C++ and compile it.

There are a few tricks to keep in mind when trying to get code to vectorize.

So let’s walk through an example I had fun with:

In [22]:
@inline function pdforwardsolve(x1,x2,x3,S11,S12,S22,S13,S23,S33)
    @fastmath begin
        R11 = 1 / sqrt(S11)
        R11x1 = R11 * x1
        L21 = R11 * S12
        L31 = R11 * S13
        R22 = 1 / sqrt(S22 - L21*L21)
        L32 = R22 * (S23 - L21 * L31)
        R33 = 1 / sqrt(S33 - L31*L31 - L32*L32)

        nR21x1 = R22 * L21 * R11x1
        R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
        nR32 = R33 * L32 * R22

        (
            R11x1,
            R22*x2 - nR21x1,
            R31x1 - nR32*x2 + R33*x3
        )
    end
end
Out[22]:
pdforwardsolve (generic function with 1 method)

This function calculates:

\begin{align*}
\textbf{L} &= \text{chol}(\textbf{S})\\
\textbf{y} &= \textbf{L}^{-1}\textbf{x},
\end{align*}

where $\textbf{S}$ is a $3\times 3$ symmetric and positive definite matrix.

Our data consists of a large number of vectors $\textbf{x}$ and matrices $\textbf{S}$, and we want to solve for $\textbf{y}$ for each of them. So we want a function to calculate this for every element.

The most important trick to remember for vectorizing code is that CPUs can efficiently load or operate on contiguous data, and “under the hood”, all data is stored linearly as a vector. That means in a matrix like:

In [23]:
reshape(collect(1:12), (4,3))
Out[23]:
4×3 Array{Int64,2}:
 1  5   9
 2  6  10
 3  7  11
 4  8  12

The 4 and 5 are next to each other in memory, while the 1 and 5, despite appearing next to each other in the matrix, are 4 appart. This is called column-major storage, because data within a column are next to one another.

Row-major storage is the opposite: data within a row are next to each other.
So, while trying to get a function to vectorize, we need to ask:

  1. What pattern would let us do multiple computations simultaneously?
  2. How then will we have to organize our data to let that happen?

Now, the standard advice you see online is, because pdforwardsolve_fm operates on one vector and one matrix at a time, we should store our data so that the data in each of these are close together. For example:

Accessing the memory in the order in which it is stored will be faster. In Fortran, this means writing loops like

do j = 1, n
   do i = 1, n
       ! operate on A(i,j) ...
   end do
  end do

On the other hand, in C we want:

for (i = 0; i < n; i++ {
   for (j = 0; j < n; j++) {
       /* operate on A[i][j] ... */
   }
}

Or if you want something official looking, from MATLAB’s documentation:

Code segment 2 is faster because it traverses the elements of the 2-D array by going down the columns in the inner loop. If the algorithm permits, you can also maximize cache efficiency by using the single index method, x(k), instead of x(r,c).

So, lets try that. First, some code to generate a fake data set.

In [24]:
using StaticArrays, Random, BenchmarkTools
N = 2827
Y32 = Matrix{Float32}(undef, N, 3);
BPP32 = Matrix{Float32}(undef, N, 10);
@inline function randpd(::Type{T} = Float64, ::Val{N} = Val(3), ::Val{R} = Val(N)) where {T,N,R}
    (@SMatrix randn(T, 2N, N)) |> x -> x' * x;
end

@inline function randpdv(::Type{T}) where T
    sS = randpd(T, Val(3))
    @inbounds SVector(sS[1], sS[4], sS[5], sS[7], sS[8], sS[9])
end

function fillpd!(X::AbstractMatrix{T}) where T
    @inbounds for i  1:size(X,1)
        X[i,:] = randpdv(T)
    end
end
randn!(@view BPP32[:,1:4]);
fillpd!(@view BPP32[:,5:10])

BPP = copy(BPP32')
Y = Matrix{Float32}(undef, 3, N)


function juliatest_1!(Y::AbstractMatrix, BPP::AbstractMatrix)
    @inbounds for i  1:size(BPP,2)
        Y[:,i] .= pdforwardsolve(
            BPP[1,i],BPP[2,i],BPP[3,i],
            BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i]
        )
    end
end

@benchmark juliatest_1!($Y, $BPP)
Out[24]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     18.924 μs (0.00% GC)
  median time:      19.005 μs (0.00% GC)
  mean time:        19.039 μs (0.00% GC)
  maximum time:     34.947 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is 19 microseconds good? Hard to know without a baseline, but let’s take a quick glance at the assembly.

In [25]:
@code_native juliatest_1!(Y, BPP)
	.text
; ┌ @ In[24]:27 within `juliatest_1!'
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	subq	$24, %rsp
	movq	%rsi, 16(%rsp)
	movq	8(%rsi), %rax
; │┌ @ array.jl:154 within `size'
	movq	32(%rax), %r12
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	testq	%r12, %r12
; │└└└└└
	jle	L479
	movq	(%rsi), %rcx
	movq	24(%rcx), %r14
	movq	(%rax), %rbx
	movq	24(%rax), %rbp
	testq	%r14, %r14
; │ @ In[24]:27 within `juliatest_1!'
	jle	L298
	movq	(%rcx), %r13
; │ @ In[24]:27 within `juliatest_1!'
	shlq	$2, %r14
	addq	$36, %rbx
	shlq	$2, %rbp
	movabsq	$140262029000352, %rax  # imm = 0x7F914C69FEA0
	vmovss	(%rax), %xmm0           # xmm0 = mem[0],zero,zero,zero
	vmovss	%xmm0, 12(%rsp)
	movabsq	$__memcpy_avx_unaligned_erms, %r15
	nopl	(%rax,%rax)
; │ @ In[24]:28 within `juliatest_1!'
; │┌ @ array.jl:730 within `getindex'
L112:
	vmovss	-32(%rbx), %xmm8        # xmm8 = mem[0],zero,zero,zero
	vmovss	-20(%rbx), %xmm1        # xmm1 = mem[0],zero,zero,zero
; │└
; │┌ @ In[22]:3 within `pdforwardsolve'
; ││┌ @ fastmath.jl:271 within `sqrt_fast'
	vsqrtss	%xmm1, %xmm1, %xmm1
; │└└
; │┌ @ array.jl:730 within `getindex'
	vmovss	(%rbx), %xmm2           # xmm2 = mem[0],zero,zero,zero
	vmovss	12(%rsp), %xmm0         # xmm0 = mem[0],zero,zero,zero
; │└
; │┌ @ In[22]:3 within `pdforwardsolve'
; ││┌ @ fastmath.jl:254 within `div_fast' @ fastmath.jl:164
	vdivss	%xmm1, %xmm0, %xmm1
; ││└
; ││ @ In[22]:4 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	-36(%rbx), %xmm1, %xmm3
; │└└
; │┌ @ fastmath.jl:163 within `pdforwardsolve'
	vmulss	-16(%rbx), %xmm1, %xmm4
; │└
; │┌ @ In[22]:6 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	-8(%rbx), %xmm1, %xmm1
	vmovss	-12(%rbx), %xmm5        # xmm5 = mem[0],zero,zero,zero
; │└└
; │┌ @ fastmath.jl:162 within `pdforwardsolve'
	vfnmadd231ss	%xmm4, %xmm4, %xmm5
; │└
; │┌ @ In[22]:7 within `pdforwardsolve'
; ││┌ @ fastmath.jl:271 within `sqrt_fast'
	vsqrtss	%xmm5, %xmm5, %xmm5
; │└└
; │┌ @ fastmath.jl:164 within `pdforwardsolve'
	vdivss	%xmm5, %xmm0, %xmm5
; │└
; │┌ @ In[22]:11 within `pdforwardsolve'
; ││┌ @ fastmath.jl:169 within `mul_fast' @ fastmath.jl:163
	vmulss	%xmm3, %xmm4, %xmm6
; │└└
; │┌ @ fastmath.jl:162 within `pdforwardsolve'
	vfnmadd213ss	-4(%rbx), %xmm1, %xmm4
; │└
; │┌ @ In[22]:8 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	%xmm5, %xmm4, %xmm4
	vmulss	%xmm1, %xmm1, %xmm7
	vfmadd231ss	%xmm4, %xmm4, %xmm7
	vsubss	%xmm7, %xmm2, %xmm2
; │└└
; │┌ @ fastmath.jl:271 within `pdforwardsolve'
	vsqrtss	%xmm2, %xmm2, %xmm2
; │└
; │┌ @ In[22]:9 within `pdforwardsolve'
; ││┌ @ fastmath.jl:254 within `div_fast' @ fastmath.jl:164
	vdivss	%xmm2, %xmm0, %xmm2
; │└└
; │┌ @ fastmath.jl:163 within `pdforwardsolve'
	vmulss	%xmm5, %xmm6, %xmm7
; │└
; │┌ @ In[22]:12 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	%xmm3, %xmm1, %xmm1
; ││└
; ││┌ @ fastmath.jl:162 within `sub_fast'
	vfmsub231ss	%xmm7, %xmm4, %xmm1
; ││└
; ││ @ In[22]:13 within `pdforwardsolve'
; ││┌ @ fastmath.jl:169 within `mul_fast' @ fastmath.jl:163
	vmulss	%xmm8, %xmm5, %xmm7
	vaddss	-28(%rbx), %xmm1, %xmm1
	vfnmadd213ss	%xmm1, %xmm4, %xmm7
; ││└
; ││ @ In[22]:15 within `pdforwardsolve'
; ││┌ @ fastmath.jl:161 within `add_fast'
	vmulss	%xmm2, %xmm7, %xmm1
; ││└
; ││┌ @ fastmath.jl:162 within `sub_fast'
	vsubss	%xmm6, %xmm8, %xmm0
	vmulss	%xmm0, %xmm5, %xmm0
; │└└
; │┌ @ broadcast.jl:800 within `materialize!'
; ││┌ @ broadcast.jl:841 within `copyto!' @ broadcast.jl:885
; │││┌ @ broadcast.jl:868 within `preprocess'
; ││││┌ @ broadcast.jl:872 within `preprocess_args'
; │││││┌ @ broadcast.jl:869 within `preprocess'
	vmovss	%xmm3, (%rsp)
	vmovss	%xmm0, 4(%rsp)
	vmovss	%xmm1, 8(%rsp)
; │││└└└
; │││ @ broadcast.jl:841 within `copyto!' @ array.jl:767
	movq	%r13, %rdi
	movq	%rsp, %rsi
	movq	%r14, %rdx
	callq	*%r15
; │└└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	addq	%r14, %r13
	addq	%rbp, %rbx
	addq	$-1, %r12
; │└└
	jne	L112
	jmp	L479
; │ @ In[24]:27 within `juliatest_1!'
L298:
	addq	$36, %rbx
	shlq	$2, %rbp
	movabsq	$140262029000352, %rax  # imm = 0x7F914C69FEA0
	vmovss	(%rax), %xmm8           # xmm8 = mem[0],zero,zero,zero
; │ @ In[24]:28 within `juliatest_1!'
; │┌ @ array.jl:730 within `getindex'
L320:
	vmovss	-32(%rbx), %xmm1        # xmm1 = mem[0],zero,zero,zero
	vmovss	-20(%rbx), %xmm2        # xmm2 = mem[0],zero,zero,zero
; │└
; │┌ @ In[22]:3 within `pdforwardsolve'
; ││┌ @ fastmath.jl:271 within `sqrt_fast'
	vsqrtss	%xmm2, %xmm2, %xmm2
; │└└
; │┌ @ array.jl:730 within `getindex'
	vmovss	(%rbx), %xmm3           # xmm3 = mem[0],zero,zero,zero
; │└
; │┌ @ In[22]:3 within `pdforwardsolve'
; ││┌ @ fastmath.jl:254 within `div_fast' @ fastmath.jl:164
	vdivss	%xmm2, %xmm8, %xmm2
; │└└
; │┌ @ fastmath.jl:163 within `pdforwardsolve'
	vmulss	-36(%rbx), %xmm2, %xmm4
; │└
; │┌ @ In[22]:5 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	-16(%rbx), %xmm2, %xmm5
; ││└
; ││ @ In[22]:6 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	-8(%rbx), %xmm2, %xmm2
	vmovss	-12(%rbx), %xmm6        # xmm6 = mem[0],zero,zero,zero
; ││└
; ││ @ In[22]:7 within `pdforwardsolve'
; ││┌ @ fastmath.jl:162 within `sub_fast'
	vfnmadd231ss	%xmm5, %xmm5, %xmm6
; │└└
; │┌ @ fastmath.jl:271 within `pdforwardsolve'
	vsqrtss	%xmm6, %xmm6, %xmm6
; │└
; │┌ @ In[22]:7 within `pdforwardsolve'
; ││┌ @ fastmath.jl:254 within `div_fast' @ fastmath.jl:164
	vdivss	%xmm6, %xmm8, %xmm6
; │└└
; │┌ @ fastmath.jl:163 within `pdforwardsolve'
	vmulss	%xmm4, %xmm5, %xmm7
; │└
; │┌ @ In[22]:8 within `pdforwardsolve'
; ││┌ @ fastmath.jl:162 within `sub_fast'
	vfnmadd213ss	-4(%rbx), %xmm2, %xmm5
; ││└
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	%xmm6, %xmm5, %xmm5
	vmulss	%xmm2, %xmm2, %xmm0
	vfmadd231ss	%xmm5, %xmm5, %xmm0
	vsubss	%xmm0, %xmm3, %xmm0
; │└└
; │┌ @ fastmath.jl:271 within `pdforwardsolve'
	vsqrtss	%xmm0, %xmm0, %xmm0
; │└
; │┌ @ In[22]:9 within `pdforwardsolve'
; ││┌ @ fastmath.jl:254 within `div_fast' @ fastmath.jl:164
	vdivss	%xmm0, %xmm8, %xmm0
; │└└
; │┌ @ fastmath.jl:163 within `pdforwardsolve'
	vmulss	%xmm6, %xmm7, %xmm3
; │└
; │┌ @ In[22]:12 within `pdforwardsolve'
; ││┌ @ fastmath.jl:163 within `mul_fast'
	vmulss	%xmm4, %xmm2, %xmm2
; │└└
; │┌ @ fastmath.jl:162 within `pdforwardsolve'
	vfmsub231ss	%xmm3, %xmm5, %xmm2
; │└
; │┌ @ In[22]:13 within `pdforwardsolve'
; ││┌ @ fastmath.jl:169 within `mul_fast' @ fastmath.jl:163
	vmulss	%xmm1, %xmm6, %xmm3
	vaddss	-28(%rbx), %xmm2, %xmm2
	vfnmadd213ss	%xmm2, %xmm5, %xmm3
; │└└
; │┌ @ fastmath.jl:161 within `pdforwardsolve'
	vmulss	%xmm0, %xmm3, %xmm0
; │└
; │┌ @ In[22]:15 within `pdforwardsolve'
; ││┌ @ fastmath.jl:162 within `sub_fast'
	vsubss	%xmm7, %xmm1, %xmm1
	vmulss	%xmm1, %xmm6, %xmm1
; │└└
; │┌ @ broadcast.jl:800 within `materialize!'
; ││┌ @ broadcast.jl:841 within `copyto!' @ broadcast.jl:885
; │││┌ @ broadcast.jl:868 within `preprocess'
; ││││┌ @ broadcast.jl:872 within `preprocess_args'
; │││││┌ @ broadcast.jl:869 within `preprocess'
	vmovss	%xmm4, (%rsp)
	vmovss	%xmm1, 4(%rsp)
	vmovss	%xmm0, 8(%rsp)
; │└└└└└
; │┌ @ range.jl:595 within `iterate'
; ││┌ @ promotion.jl:403 within `=='
	addq	%rbp, %rbx
	addq	$-1, %r12
	jne	L320
; │└└
L479:
	movabsq	$140262417817608, %rax  # imm = 0x7F916396E008
	addq	$24, %rsp
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	nopl	(%rax,%rax)
; └

All we see are xmm registers accompanied with instructions that have an “s” in the second-last position, meaning “single” element, not “packed” vector.

So the code is not vectorized at all. For good measure, let’s add an @simd ivdep. The ivdep makes a few promises to the compiler that each iteration of the for loop is independent $-$ that one iteration will not dependent on the result from a previous one.

In [26]:
function juliatest_2!(Y::AbstractMatrix, BPP::AbstractMatrix)
    @inbounds @simd ivdep for i  1:size(BPP,2)
        Y[:,i] .= pdforwardsolve(
            BPP[1,i],BPP[2,i],BPP[3,i],
            BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i]
        )
    end
end
@benchmark juliatest_2!($Y, $BPP)
Out[26]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     19.464 μs (0.00% GC)
  median time:      19.560 μs (0.00% GC)
  mean time:        19.603 μs (0.00% GC)
  maximum time:     35.195 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

No luck.

In this case, we can look at pdforwardsolve, and see that while there is a lot of math, many of those calculations within the function depend on earlier answers. If we need one answer before getting to calculate the next, we can’t calculate them at the same time!
That more or less means that calculating one instance of pdforwardsolve cannot be vectorized.

But, wait!
What about instead just calculating a bunch of pdforwardsolve at in parallel, at the same time (on a single core)?

So, let’s try swapping the indices, and using the transposed versions of $\textbf{Z}$ and $\textbf{BPP}$:

In [27]:
function juliatest_3!(Y::AbstractMatrix, BPP::AbstractMatrix)
    @inbounds @simd ivdep for i  1:size(BPP,1)
        Y[i,:] .= pdforwardsolve(
            BPP[i,1],BPP[i,2],BPP[i,3],
            BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
        )
    end
end
@benchmark juliatest_3!($Y32, $BPP32)
Out[27]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     20.631 μs (0.00% GC)
  median time:      21.123 μs (0.00% GC)
  mean time:        21.047 μs (0.00% GC)
  maximum time:     45.611 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Worse. This part of optimization can be tricky: you know what you want the compiler to do (vectorize), we know that it is possible, (why wouldn’t it be?), but the compiler isn’t cooperating.

The problem is that Julia gets confused. I think what is happening is that, with ivdep, we promised that each of the arrays are distinct, and don’t overlap. That is, that Z and BPP are separate.
But, we have a whole lot of referenecs to the same BPP, and we’re writing into a slice of Z. So the compiler isn’t convinced that these don’t overlap or conflict in some way.

So let’s try and be a little more explicit.
Instead of pointing to a bunch of different indices within the same two arrays, we’re taking a view of each of the columns of these arrays. A view, like the name implies, is just a look at the parent. Meaning editing Y1 edits the first column of Y, because Y1 is the first column of Y.
But splitting everything appart in this way makes it more obvious that all these columns are in fact, separate.

In [28]:
function juliatest_4!(Y::AbstractMatrix{T}, Data::AbstractMatrix{T}) where T
    @inbounds begin
        Y1, Y2, Y3 = @views Y[:,1], Y[:,2], Y[:,3]
        X1, X2, X3 = @views Data[:,1], Data[:,2], Data[:,3]
        S_11, S_12, S_22 = @views Data[:,5], Data[:,6], Data[:,7]
        S_13, S_23, S_33 = @views Data[:,8], Data[:,9], Data[:,10]
    end
     @inbounds @simd ivdep for i  1:size(Data,1)
        (y1, y2, y3) = pdforwardsolve(
            X1[i], X2[i], X3[i],
            S_11[i], S_12[i], S_22[i], S_13[i], S_23[i], S_33[i]
        )
        Y1[i] = y1
        Y2[i] = y2
        Y3[i] = y3
    end
end
@benchmark juliatest_4!($Y32, $BPP32)
Out[28]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.865 μs (0.00% GC)
  median time:      2.897 μs (0.00% GC)
  mean time:        2.898 μs (0.00% GC)
  maximum time:     4.657 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

So, now the compiler is suddenly confident enough to vectorize the code, and it’s 6 times faster! Nice.

Before seeing if we can make this even faster, I wanted to comment on something else necessary:
To vectorize, you have to do the same thing to all elements.
That means every “branch” where code could take more than one possible path, like in an “if, then” statement, causes a problem. It is possible to get around this, but it causes problems, and will often stop vectorization.

That might not seem like a big deal $-$ our code didn’t have any obvious branches, right?
Well…

In [29]:
sqrt(2.3)
Out[29]:
1.51657508881031
In [30]:
sqrt(-2.3)
DomainError with -2.3:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

Stacktrace:
 [1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:31
 [2] sqrt(::Float64) at ./math.jl:492
 [3] top-level scope at In[30]:1

The code just did one of two different things.
It either returned a number, or through an error.

Or:

In [31]:
Y32[1000,1]
Out[31]:
0.28272918f0
In [32]:
Y32[10000,1]
BoundsError: attempt to access 2827×3 Array{Float32,2} at index [10000, 1]

Stacktrace:
 [1] getindex(::Array{Float32,2}, ::Int64, ::Int64) at ./array.jl:730
 [2] top-level scope at In[32]:1

Again, one of two different things. These are branches, either of which can kill vectorization.

But the square root is especially insidious. For the bounds check, compilers are often smart enough to realize that if the very first and very last bounds check are okay (or maybe only the last), then all of them must be okay. If they notice that, they’ll only make the single check, and everything is okay.

Alternatively, @inbounds tells the compiler not to check the boundaries of the array, and just trust everything is within bounds.

The sqrt is harder, because how is the compiler supposed to know anything about the numbers that you’re sending in to that function? There isn’t a simple precheck it can use to figure out if things are going to be okay.

A solution there is @fastmath, which tells the compiler (among other things) not to worry about checking square roots:

In [33]:
@fastmath sqrt(-2.3)
Out[33]:
NaN

No branch, instead it returns the double precision number NaN

In [34]:
typeof(NaN)
Out[34]:
Float64

otherwise, acting exactly as it normally does.

While we want both @inbounds and @fastmath while optimizing code, we’d prefer neither while debugging.
Errors make it easier to track down where a problem came from.
sqrts silently introducing infectious NaNs can be unpleasant to track down. Even worse, Julia is likely to outright crash with a segmentation fault from @inbounds, when the memory really was not inbounds. Note that @inbounds only works inside of functions, so you can’t just try @inbounds [1,2][1000] in the REPL.

These are tradeoffs and considerations you have to consider in every language. For example, in C++, array boundaries are not checked by default $-$ you’d just get the crash!

Finally, the @inline in front of the definition of pdforwardsolve tells the compiler that when you write another function calling pdforwardsolve, instead of actually calling a compiled pdforwardsolve, you want to insert the code. The compiler will often decide to do that automatically, even without @inline, but it is important to make sure that is happening when you vectorize code.
This is because, for the for loop to get vectorized, the compiler has to know what it is vectorizing, and generate that vectorized code. If all it has to work with is an opaque “call some other function named pdforwardsolve“, the compiler wouldn’t know what to do with that.

I had written this code in a few different languages, including C++, Fortran, Rust, and ispc.
Anyway, C++ code doing the same thing:

#include <math.h>

inline void fpdforwardsolve(float* __restrict Z, float* __restrict BPP, int n, int N){

    float x1  = BPP[n      ];
    float x2  = BPP[n +   N];
    float x3  = BPP[n + 2*N];
    float S11 = BPP[n + 4*N];
    float S12 = BPP[n + 5*N];
    float S22 = BPP[n + 6*N];
    float S13 = BPP[n + 7*N];
    float S23 = BPP[n + 8*N];
    float S33 = BPP[n + 9*N];

    float R11 = 1 / sqrtf(S11);
    float R11x1 = R11 * x1;
    float L21 = R11 * S12;
    float L31 = R11 * S13;
    float R22 = 1 / sqrtf(S22 - L21*L21);
    float L32 = R22 * (S23 - L21*L31);
    float R33 = 1 / sqrtf(S33 - L31*L31 - L32*L32);

    float nR21x1 = R22 * L21*R11x1;
    float R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 );
    float nR32 = R33 * L32 * R22;


    Z[n      ] = R11x1;
    Z[n +   N] = R22*x2 - nR21x1;
    Z[n + 2*N] = R31x1 - nR32*x2 + R33*x3;

}

extern "C" {
void processBPP(float* __restrict Z, float* __restrict BPP, int N){
    for (int n = 0; n < N; n++){
        fpdforwardsolve(Z, BPP, n, N);
    }
}
}

If you want to treat it as C code:

  1. Get rid of the extern "C" { } surrounding the function processBPP. That line tells a C++ compiler to externally represent that function as though it were C. C automatically externally represents its functions as though they were C.
  2. Delete the double underscores in front of the __restricts, replacing them with a simply restrict.

I compiled this code with the following two compiler commands, to test two different compilers:

g++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test.so  vectorization_test.cpp

and

clang++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libclangvectorization_test.so  vectorization_test.cpp

Run down of the commands:

  1. g++/clang++: the name of the compilers. g++ is from the GNU compiler collection, while clang is an “LLVM”-based C++ compiler. The LLVM project is a collection of tools meant to be modular. To get an idea of the aims and goals of LLVM, the first option under LLVM Tutorial is a tutorial for how to create your own language using LLVM. Surprise, surprise: Julia is also built on LLVM. As are languages like Swift, Rust, and ispc.
  2. -Ofast: this turns on a lot of optimizations, as well as @fastmath, turning off errors from math functions, and treats floating point math as though it were associative.
  3. -march=native: optimize the code to run fast on the computer compiling the code. And it is okay if the code cannot run on other computer. If you plan on letting people download and run your compiled code, you can’t use this.
  4. -shared -fPIC we want a shared library (requires -fPIC). Languages like Julia and R can call compiled shared libraries (.dll on Windows, .dylib on Macs, and .so on Linux).
  5. -mprefer-vector-width=512 gcc in particular doesn’t like using 512-bit vectors unless asked. LLVM is better about this. Don’t use this if your computer doesn’t support 512-bit vectors.
  6. -o libclangvectorization_test.so name of the output shared library we are creating. Change the suffix based on your operating system.
  7. vectorization_test.cpp the name of the file I saved the C++ code as.

To call the C++ code from Julia, all we need is what follows. We test to make sure we get the same answers as from Julia before we benchmark, because getting the right answer is more important than speed.

In [35]:
const cpptest_dir = "/home/chriselrod/Documents/progwork/julia/probcol/processBPP_tests"
const clanglib = joinpath(cpptest_dir, "libclangvectorization_test.so")
const gpplib = joinpath(cpptest_dir, "libgppvectorization_test.so")
function gpptest(Y, BPP, N)
    ccall((:processBPP, gpplib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
function clangtest(Y, BPP, N)
    ccall((:processBPP, clanglib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
N32 = Int32(N)

fill!(Y, 0); juliatest_1!(Y, BPP); Y
Out[35]:
3×2827 Array{Float32,2}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312362   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [36]:
fill!(Y32, 0); juliatest_4!(Y32, BPP32); Y32'
Out[36]:
3×2827 Adjoint{Float32,Array{Float32,2}}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312362   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [37]:
fill!(Y32, 0); gpptest(Y32, BPP32, N32); Y32'
Out[37]:
3×2827 Adjoint{Float32,Array{Float32,2}}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312363   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [38]:
fill!(Y32, 0); clangtest(Y32, BPP32, N32); Y32'
Out[38]:
3×2827 Adjoint{Float32,Array{Float32,2}}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312362   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [39]:
@benchmark gpptest($Y32, $BPP32, $N32)
Out[39]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.521 μs (0.00% GC)
  median time:      3.529 μs (0.00% GC)
  mean time:        3.551 μs (0.00% GC)
  maximum time:     6.469 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8
In [40]:
@benchmark clangtest($Y32, $BPP32, $N32)
Out[40]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.565 μs (0.00% GC)
  median time:      1.616 μs (0.00% GC)
  mean time:        1.604 μs (0.00% GC)
  maximum time:     4.663 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Wow! Clang smoked Julia and g++. What is going on?

Looking at the assembly reveals a trick both g++ and Clang are using, but Julia is not.
Clang is using LLVM version 7, and Julia is LLVM version 6. Once Julia upgrades, we should see it closer Clang.

The trick is in our square roots, eg:

    float R11 = 1 / sqrtf(S11);

Julia is calculating the vector of R11 values by taking the square root of a vector of S11 values, and then dividing a vector of 1s by this vector of square roots.
However, the the CPU I am running this code on has an innacurate instruction called vrsqrt14ps. This instruction calculates an approximation to the recirprical of a square root, and is extremely fast. Unfortunately, it is just an approximation, with relative error of $10^{-14}$. So instead of using this value directly, it performs a Newton-Raphson iteration to cut down the error:

rsqrt(a) = -0.5     * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)

Now, I said both g++ and Clang use this trick. Then why was g++ slower?
Because of a bug, where for my CPU the compiler doesn’t realize it can use this rsqrt approximation. Instead, it does the same thing Clang does to approximate the reciprocal square root…and then multiply that by the original number, to approximate the square root, and then it inverts that approximation to the square root using some other convoluted scheme. Yikes.

If we instead compile with:

g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test2.so  vectorization_test.cpp

we don’t give it permission to approximate that square root calculation (while still turning off errors, to let the code be vectorized):

In [41]:
const gpplib2 = joinpath(cpptest_dir, "libgppvectorization_test2.so")
function gpptest2(Y, BPP, N)
    ccall((:processBPP, gpplib2), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
fill!(Y32, 0); gpptest2(Y32, BPP32, N32); Y32'
Out[41]:
3×2827 Adjoint{Float32,Array{Float32,2}}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312363   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [42]:
@benchmark gpptest2($Y32, $BPP32, $N32)
Out[42]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.868 μs (0.00% GC)
  median time:      2.882 μs (0.00% GC)
  mean time:        2.889 μs (0.00% GC)
  maximum time:     4.729 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

So now it is the same fast as Julia, because it’s doing the same thing.

But hmm. We just realized that Clang (thanks to LLVM 7 $-$ LLVM 6 didn’t do that!) is faster than Julia. What can we do?
One choice is to wait until Julia upgrades to LLVM 7. Nah.
Another is to try and compile Julia from source to use LLVM 7. I don’t know enough about Julia’s internals to upgrade it myself. Julia needs a bunch of patches, so it requires a whole lot more work than just telling it to download a different version while building.

So, what does that leave us with?
Well, for starters, we can write our own rsqrt function. In my library SIMDPirates, I have:

@inline function rsqrt_fast(x::NTuple{16,Core.VecElement{Float32}})
    Base.llvmcall(
    """ %rs = call <16 x float> asm "vrsqrt14ps \$1, \$0", "=x,x"(<16 x float> %0)
        ret <16 x float> %rs""",
    NTuple{16,Core.VecElement{Float32}}, Tuple{NTuple{16,Core.VecElement{Float32}}}, x)
end
@inline function rsqrt(x::NTuple{16,Core.VecElement{Float32}})
    r = rsqrt_fast(x)
    # Performs a Newton step to increase accuracy.
    ns = vfma(vmul(r,r), x, -3.0f0)
    vmul(vmul(-0.5f0, r), ns)
end
@inline rsqrt_fast(x::AbstractStructVec) = SVec(rsqrt_fast(extract_data(x)))
@inline rsqrt(x::AbstractStructVec) = SVec(rsqrt(extract_data(x)))
@inline rsqrt_fast(x) = @fastmath inv(sqrt(x))
@inline rsqrt(x) = @fastmath inv(sqrt(x))

rsqrt_fast calls assembly (asm) instruction vrsqrt14ps, and then rsqrt calls this and adds the Newton Raphson iteratin. Nice!

But, then what about actually vectorizing a for loop? This function would have to be called with an actual vector of 16 32-bit floats to use that instruction. (And what about other functions where it is hard to convince the compiler to vectorize them?)

Well, I wrote a library LoopVectorization that forcably vectorizes some kinds of for loops. Trying it out here:

In [43]:
using SIMDPirates, LoopVectorization

@inline function pdforwardsolve_nofm(x1,x2,x3,S11,S12,S22,S13,S23,S33)
    R11 = rsqrt(S11)
    R11x1 = R11 * x1
    L21 = R11 * S12
    L31 = R11 * S13
    R22 = rsqrt(S22 - L21*L21)
    L32 = R22 * (S23 - L21 * L31)
    R33 = rsqrt(S33 - L31*L31 - L32*L32)

    nR21x1 = R22 * L21 * R11x1
    R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
    nR32 = R33 * L32 * R22

    (
        R11x1,
        R22*x2 - nR21x1,
        R31x1 - nR32*x2 + R33*x3
    )
end


@generated function juliatest!(X::AbstractMatrix{T}, BPP::AbstractMatrix{T}) where T
    quote
        @vectorize $T for i  1:size(BPP,1)
            X[i,:] .= pdforwardsolve_nofm(
                BPP[i,1],BPP[i,2],BPP[i,3],
                BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
            )
        end
    end
end

fill!(Y32, 0); juliatest!(Y32, BPP32); Y32'
Out[43]:
3×2827 Adjoint{Float32,Array{Float32,2}}:
  1.85509    0.678078    0.870311  …  -0.300907  -0.434875  -0.532785
 -0.553233  -0.0312363   0.154491     -0.300106  -1.71175    0.23169 
  1.24083    0.415602   -0.63614       1.56528   -1.70304   -1.38821 
In [44]:
@benchmark juliatest!($Y32, $BPP32)
Out[44]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.764 μs (0.00% GC)
  median time:      1.773 μs (0.00% GC)
  mean time:        1.790 μs (0.00% GC)
  maximum time:     4.412 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

Not quite as good as Clang, but close!
We can use the macro @macroexpand to expand the @vectorize macro and see what it does. We’ll also use striplines from the MacroTools library, because this stripes out a bunch of debugging information that clutters up the expansion.

In [45]:
using MacroTools: striplines

(@macroexpand @vectorize Float32 for i  1:size(BPP,1)
            X[i,:] .= pdforwardsolve(
                BPP[i,1],BPP[i,2],BPP[i,3],
                BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
            )
        end) |> striplines
Out[45]:
quote
    ##N#568 = size(BPP, 1)
    (##Q#585, ##rem#586) = (LoopVectorization.VectorizationBase).size_loads(BPP, 1, Val{16}())
    ##pX#569 = vectorizable(X)
    ##pBPP#574 = vectorizable(BPP)
    begin
        ##stride#573 = LoopVectorization.stride_row(X) * sizeof(eltype(##pX#569))
        ##stride#575 = LoopVectorization.stride_row(BPP) * sizeof(eltype(##pBPP#574))
    end
    begin
        for ##i#567 = 0:64:##Q#585 * 64 - 4
            begin
                ####pBPP#574_i#576 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 0##stride#575)
                ####pBPP#574_i#577 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 1##stride#575)
                ####pBPP#574_i#578 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 2##stride#575)
                ####pBPP#574_i#579 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 4##stride#575)
                ####pBPP#574_i#580 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 5##stride#575)
                ####pBPP#574_i#581 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 6##stride#575)
                ####pBPP#574_i#582 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 7##stride#575)
                ####pBPP#574_i#583 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 8##stride#575)
                ####pBPP#574_i#584 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 9##stride#575)
                begin
                    ##B#570 = LoopVectorization.extract_data.(pdforwardsolve(####pBPP#574_i#576, ####pBPP#574_i#577, ####pBPP#574_i#578, ####pBPP#574_i#579, ####pBPP#574_i#580, ####pBPP#574_i#581, ####pBPP#574_i#582, ####pBPP#574_i#583, ####pBPP#574_i#584))
                    for ##j#572 = 0:SIMDPirates.vsub(length(##B#570), 1)
                        begin
                            $(Expr(:inbounds, true))
                            local #473#val = SIMDPirates.vstore(getindex(##B#570, SIMDPirates.vadd(1, ##j#572)), SIMDPirates.vadd(##pX#569, ##i#567, SIMDPirates.vmul(##stride#573, ##j#572)))
                            $(Expr(:inbounds, :pop))
                            #473#val
                        end
                    end
                end
            end
        end
    end
    begin
        if ##rem#586 > 0
            ##mask#587 = one(UInt16) << ##rem#586 - one(UInt16)
            ##i#567 = (##N#568 - ##rem#586) * 4
            begin
                ####pBPP#574_i#576 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 0##stride#575, ##mask#587)
                ####pBPP#574_i#577 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 1##stride#575, ##mask#587)
                ####pBPP#574_i#578 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 2##stride#575, ##mask#587)
                ####pBPP#574_i#579 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 4##stride#575, ##mask#587)
                ####pBPP#574_i#580 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 5##stride#575, ##mask#587)
                ####pBPP#574_i#581 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 6##stride#575, ##mask#587)
                ####pBPP#574_i#582 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 7##stride#575, ##mask#587)
                ####pBPP#574_i#583 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 8##stride#575, ##mask#587)
                ####pBPP#574_i#584 = SIMDPirates.vload(SVec{16,Float32}, ##pBPP#574 + ##i#567 + 9##stride#575, ##mask#587)
                begin
                    ##B#570 = LoopVectorization.extract_data.(pdforwardsolve(####pBPP#574_i#576, ####pBPP#574_i#577, ####pBPP#574_i#578, ####pBPP#574_i#579, ####pBPP#574_i#580, ####pBPP#574_i#581, ####pBPP#574_i#582, ####pBPP#574_i#583, ####pBPP#574_i#584))
                    for ##j#572 = 0:SIMDPirates.vsub(length(##B#570), 1)
                        begin
                            $(Expr(:inbounds, true))
                            local #474#val = SIMDPirates.vstore(getindex(##B#570, SIMDPirates.vadd(1, ##j#572)), SIMDPirates.vadd(##pX#569, ##i#567, SIMDPirates.vmul(##stride#573, ##j#572)), ##mask#587)
                            $(Expr(:inbounds, :pop))
                            #474#val
                        end
                    end
                end
            end
        end
    end
end

Odds are everything isn’t suddenly clear at a glance.
But what is happening is that it takes our for loop, and unrolls it, so that it can handle many iterations at a time with vectors. It also takes care of loading and storing these vectors from vectorizable forms of our input arrays.

Finally, because not all numbers are divisible by [length of the vectors], it handles the remaining iterations with a masked-operations. Meaning, if the for loop covered 106 iterations and the vectors were of length 16:

  1. The for loop would repeat 6 times.
  2. The last 10 elements would be loaded with a masked load, operations would be performed, and then the answers written with a masked store. The “masks” hide the excess elements, so that we don’t read or write from memory Julia does not have access to (which could trigger a segmentation fault).

Additionally, in forcing vectorization, we don’t need the @fastmath flag. Actually, that flag currently makes the code slower. I’ll fix that sometime.

This macro is a final hammer that can force vectorization.
Other instances where it works is when loops contain functions like log, exp, or sin, which (unlike sqrt) aren’t natively supported on most CPUs.