Before we try to hand-optimize the Gibbs model, let’s talk some about how to actually optimize code. Or at least hit a few basics.

Here I wanted to write a short introduction to writing fast code.

For code to run fast, there are a few basic principles:

- More efficient algorithm. This is where the biggest wins are possible!
- Don’t compute anything you don’t have to. Don’t compute anything more than once (unless it’s faster/cheaper to re-compute than to store and look up).
- Parallelism. CPU clock speeds have largely stopped increasing in recent years. Instead, recent increases in CPU throughput have come in the form of parallelism. Parallelism takes many forms, the most obvious which is multiple separate CPU cores. Another major source of speed up is SIMD (Single Instruction Multiple Data), where a single instruction from one CPU core can act on multiple pieces of data simultaneously. Most recent CPUs have 256 bit registers (where the data is held while the CPU can operate on it). A standard double precision number is 64 bits, meaning these CPUs can potentially operate on 4 numbers at a time. A few CPUs have already upgraded to 512 bit registers, potentially acting on 8 double precision numbers or 16 single precision numbers per opertion.
- Memory is slow! Consider this benchmark:

```
using BenchmarkTools, SIMDArrays
A = @Static randn(8,13)
B = @Static randn(13,10)
@benchmark foreach(_ -> $A * $B, 1:1000)
```

```
# 299_792_458 speed of light in m/s * 25 ns * 1 s / 1e9 ns = meters light travels
299_792_458 * 19 * 1 / 1e9
```

I repeated the matrix multiplication benchmark 1000 times to get more accurate timings.

Light travels 299,792,458 meters per second. That means that light can only travel about 5.7 meters in the time it takes to multiply an $8 \times 12$ and a $12 \times 10$ matrix.

That computation requires taking $8 * 10 = 80$ dot products with vectors of length 13. Each of these dot products requires 13 multiplications and 12 additions. That means that a single CPU core did 1040 multiplications and 960 additions in the time it took *light* to travel 5.7 meters, less than the width of a typical office.

Each core of a computer can churn through data blisteringly fast.

The CPU’s speed is battling the laws of physics. But memory is much slower. Figuring out where the memory actually is that we’d like to access, and moving it to the CPU is extremely slow compared to how fast the CPU is at handling memory that is actually there. RAM can be far away.

An example of where this can be a problem, consider a simple dot product.

```
using LinearAlgebra
BLAS.set_num_threads(1)
x = rand(128); y = rand(128);
@btime foreach(_ -> $x'* $y, 1:10^6)
```

```
x = rand(128 * 10^6); y = rand(128 * 10^6);
@btime $x'* $y
```

```
156.32 / 14.843
```

Multiplying the length of a vector by a million makes the dot product more than 10 million times slower!

More or less, that means for every microsecond the CPU spent working, it spent at least 9 microseconds just waiting for data to show up!

I won’t focus on this much in this post, but it’s something to be mindful of while trying to write fast code.

Try not to jump around in memory much. CPUs have different levels of cache memory. These caches are much smaller than RAM, but they’re much faster. Typically, L3 is large and shared among cores, while each core gets their own L2 and L1 cache. The L1 cache is the smallest, but also the closest to the action and the fastest.

On my computer, each core has 32 KiB of L1 data cache, meaning it can store roughly 32 KiB * 1024 B/KiB / (8 B / 64-bit float) = 4096 64 bit floats in the L1 cache. Breaking bigger problems down and processing them in chunks small enough to fit in the L1 cache, eg using less than 4096 doubles at a time, will be faster.

Many optimized libraries and aglorithms do this. For example, optimized matrix multiplication routines do this for via multiplying big matrices block-wise, where these blocks are small enough to fit in the L1 cache.

My focus here will be on the SIMD (Single Instruction Multiple Data) instructions. This is also often known as vectorization, because each instruction operates on a “vector” of data. This can be confusing for those coming from `R`

, `MATLAB`

, or `Python`

, where “vectorization” often means “not using a for loop”.

Vectorizing in languages with great compilers, such as `Julia`

, `C++`

, or `Fortran`

, means having the compiled generate SIMD instructions.

Many “vectorized” functions from the former sets of languages will also often really be vectorized in the sense of the latter, because they’re likely to be calling compiled `C`

/`C++`

/`Fortran`

code.

Within the compiled languages, “for loops” often are vectorized. For example,

```
function my_dot_product(a, b)
d = 0.0
@inbounds @simd for i ∈ eachindex(a, b)
d += a[i] * b[i]
end
d
end
x = rand(128); y = rand(128)
@btime $x' * $y
```

```
@btime my_dot_product($x, $y)
```

We get the same answer as the “builtin” bar rounding error, but our function is actually faster! That’s because the base dot product actually calls an external BLAS library instead of native Julia code. The advantage of the external library is that it is multithreaded (so long as we don’t set `BLAS.set_num_threads(1)`

), which speeds it up for really big vectors. Memory throughput should be higher when there’s more cores.

For vectors too small to benefit, it’s faster not to call an external library.

We can look at the assembly instructions of `my_dot_product`

with `@code_native`

.

When looking for vecorization, the things to look for are vector registers. `xmm`

registers are 128 bits, `ymm`

registers are 256 bits, and `zmm`

registers are 512 bits. An instruction will often look like this:

```
vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0
```

the instruction is named `vfmadd231`

. The `fmadd`

says what it is doing. `fmadd`

means fused-multiply-add, meaning it is doing both an addition and a multiplication. `add`

does addition, `sub`

subtraction, `mul`

multiplication, `div`

division, etc. `mov`

means to move memory from one place to another.

The second last letter, `p`

, means “packed”, meaning that it is operating on a vector “packed” with numbers. It could alternatively say `s`

, which means it is only operating on a single number, even if the register could hold an entire vector.

The `d`

means “double precision”, meaning the operations is multiplying and adding double precision (64 bit) floating point numbers. Remember that the `zmm`

register holds 512 bits — that’s just 512 zeros/ones. That last letter of the instruction specifies how exactly those zeros and ones are being interpreted. `q`

means integer, and `s`

means single precision.

This instruction has 3 inputs (but many have only 1 or 2): it is multiplying numbers from a region in memory `(%rdx,%rdi,8)`

with the register `zmm4`

, and adding the result to `zmm0`

.

In AT&T syntax (which this is), the answer gets stored on the right most of the registers that follow, meaning `zmm0`

gets overwritten with the sum of itself, and the product of `(%rdx,%rdi,8)`

and `%zmm4`

. That is:

`%zmm0 = (%rdx,%rdi,8) * %zmm4 + %zmm0`

Because `zmm`

registers hold 8 doubles, this single instruction does 8 multiplications and 8 additions.

Without further adieu:

```
@code_native my_dot_product(x, y)
```

That is a lot of output, but the most important part:

```
L128:
vmovupd (%rcx,%rdi,8), %zmm4
vmovupd 64(%rcx,%rdi,8), %zmm5
vmovupd 128(%rcx,%rdi,8), %zmm6
vmovupd 192(%rcx,%rdi,8), %zmm7
; ││└
; ││┌ @ float.jl:395 within `+'
vfmadd231pd (%rdx,%rdi,8), %zmm4, %zmm0
vfmadd231pd 64(%rdx,%rdi,8), %zmm5, %zmm1
vfmadd231pd 128(%rdx,%rdi,8), %zmm6, %zmm2
vfmadd231pd 192(%rdx,%rdi,8), %zmm7, %zmm3
; │└└
; │┌ @ int.jl:53 within `macro expansion'
addq $32, %rdi
cmpq %rdi, %rsi
jne L128
```

The `vmovupd`

instruction loads data from memory into a register. Once these are loaded, it uses the `vfmadd`

instructions to multiply them with more numbers from a register, and update cumulative sums.

It does each of these things four times total. Meaning it loads does 32 multiplications and 32 additions.

Then it increments a loop counter (`\%rdi`

) with `addq`

, then `cmpq`

(compare-integers) to check if the loop is over, and finally conditionally jumps (`jne`

) back to `L128`

(location marked 128) or continues on in the code. `L128`

is the start of the loop, just before memory was loaded. Meaning if it jumps back, the loop, loops.

So hold on. We wrote a `for`

loop, where each iteration we said get 1 element from both arrays, multiply the two of tem, and add them to the cumulative sum. But this assembly code is instead saying, do that to 32 numbers per loop iteration!

That’s not what our code said!

Tricks like this are a part of why languages like `C++`

are fast: the compiler sees what you wrote, but then tells the computer to do something that will get the same answer, but is much faster.

Actually, that is not quite true here. Floating point arithmetic is not associative. That means that $(a + b) + c$ does not not necessarilly equal $a + (b + c)$, due to rounding errors.

For example:

```
1e18 + 65.0 - 1e18 - 65.0
```

```
foo(a, b) = a + b - a - b
foo(1e18, 65)
```

```
foo(1e18, 63)
```

You’re likely to get the same result in any language. Why?

Because the machine error at a number as massive as `1e18`

is 128.

```
eps(1e18)
```

This means that the difference between `1e18`

and the next two closest numbers 64 bit floats can represent is 128.

```
nextfloat(1e18) - 1e18
```

```
1e18 - prevfloat(1e18)
```

Therefore, `1e18 + 65`

rounds to `1e18 + 128`

. Then when we subtract `1e18`

, we get 128. Subtract 65, and we have 63.

Similarly, `1e18 + 63`

rounds to `1e18`

. Subtract `1e18`

, and we have 0. Subtract 63, and we have -63.

Because associativity is not true for floating point numbers, compilers will not, by default, apply any associative-transforms to reorder floating point math. When in doubt, it is safer to assume you meant exactly what you wrote in the way you wrote it.

It would also be confusing if turning on compiler optimizations by itself suddenly changed the answer you got! Rounding errors are not always trivial. $\pm 63$ is not trivial!

An example of an algorithm written with the (lack of) associativity in mind, see Kahan summation. Kahan summation is a numerically accurate summation algorithm. It works by, while summing up your vector, also calculating and tracking the numerical erorr of your sum, and repeatedly adding this numerical eror back onto the sum to correct it.

If a compiler were allowed to assume that floating point math is associative, it would realize that the numerical error is (given associativity!) zero, and therefore delete all your code handling error. Then, suddenly, your accurate summation algorithm is a regular summation algorithm.

In our dot product, however, the compiler did assume associativity, and reorder our computations.

It didn’t calculate one product, add it to a single cumulative sum, and repeat. It calculated 32 products at a time, accumulating 32 separate sums. And only at the end, did it finally add these 32 sums together (plus add a second loop to cover any straglers, because most vectors’ length’s aren’t a product of 32).

While that method is actually more accurate than simply doing it in order with a single sum. But it isn’t what we wrote, and the compiler can’t know on it’s own that we were not intending to do something fancy like Kahan summation.

So we have to give the compiler permission to optimize, telling it that isn’t what we were doing. That is what the `@simd`

in front of our `for`

loop does.

Another macro that does this and applies to more than `for`

loops is `@fastmath`

:

```
fast_foo(a, b) = @fastmath a + b - a - b
fast_foo(1e18, 65.0)
```

```
@code_native foo(1e18, 65.0)
```

```
@code_native fast_foo(1e18, 65.0)
```

the `vxorps`

instruction does the `xor`

operation. `xor`

means, “or, but not and”. `xor`

-ing anything with itself always equals zero because everything is “and” themselves. So while `foo`

takes us literally and does exactly what we asked for, fast_foo did the algebra, relealized everything cancels out, and realized all it has to do is spit out “0”.

Also, integer arithmetic, unlike floating point arithmetic, *is* associative. So no promises needed:

```
@code_native foo(10^18, 65)
```

Our regular, non-fastmath foo, when compiling a version for integers, also just `xor`

s to spit out a zero.

Another reason these compiled languages are faster is “2.” from our list: not doing anything you don’t have to, and just doing the things you need to once. eg, sorting out all the administrative details of what the input data is, and then what happens to it.

Figuring out what to do on the fly can take many times longer than just doing it, if “it” is fast.

In interpreted languages, like `R`

and `Python`

, each variable is a reference to data. Then when you want to do something with those variables, the interpreter has to “interpret” the meanings of everything as it happens, figuring out what the variables are, what it’s supposed to do with them, etc. We can emulate this from Julia like this:

```
x = Ref(1.0); y = Ref(1.0)
@btime x[] + y[]
```

The most important factor of the above benchmark was leaving off the `$`

I used earlier each time I benchamrked. The `$`

lets the benchmark know the types of the variables I am benchmarking. This would let it emulate calling the benchmarked expression from within a function. Inside a function, Julia should (normally) be aware of the types of all the variables, which lets it sort out the sequence of computations it actually has to do ahead of time.

Now, 30-something nanoseconds isn’t bad on its own.

If R spends a 100ns interpreting things when you call `rstan`

, who cares?

If all you’re doing is calling big functions or “vectorized” routines written in other languages, performance is fine. Actually, if that is all you’re doing, the interpreter is likely to spend a lot less time figuring out which already-compiled C code it should be running than Julia will compiling code before it gets to run it.

Try comparing how long it takes a fresh R and Julia session to load a plotting package and spit out the first plot.

The problem becomes apparent when you want to write fast routines yourself, building them out of lots of little instructions.

Imagine if, in our matrix multiplication, our 1040 multiplications and 960 additions each took 30+ ns. We certainly couldn’t have done it all in less than 20 ns!

Or, see what happens to a for loop:

```
x = rand(128); y = rand(128)
@btime (d = 0; for i ∈ eachindex(x, y); d += x[i] * y[i]; end; d)
```

```
@btime my_dot_product($x, $y)
```

Scaling this up, you can imagine: if some Julia code takes 1 hour to run, it could take a month to run interpreted.

In cases like that, who cares about compiling for a second?

This can be important if you’d like to run length simulations, without being constrained to using tools that already exist. How often is the best tool for the job something that’s already there and available?

Of course, other tools like `Rcpp`

exist to do this from within `R`

, and popular packages like `JAGS`

/`NIMBLE`

and `rstan`

will translate a DSL (domain-specific language) to `C++`

and compile it.

There are a few tricks to keep in mind when trying to get code to vectorize.

So let’s walk through an example I had fun with:

```
@inline function pdforwardsolve(x1,x2,x3,S11,S12,S22,S13,S23,S33)
@fastmath begin
R11 = 1 / sqrt(S11)
R11x1 = R11 * x1
L21 = R11 * S12
L31 = R11 * S13
R22 = 1 / sqrt(S22 - L21*L21)
L32 = R22 * (S23 - L21 * L31)
R33 = 1 / sqrt(S33 - L31*L31 - L32*L32)
nR21x1 = R22 * L21 * R11x1
R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
nR32 = R33 * L32 * R22
(
R11x1,
R22*x2 - nR21x1,
R31x1 - nR32*x2 + R33*x3
)
end
end
```

This function calculates:

\begin{align*}

\textbf{L} &= \text{chol}(\textbf{S})\\

\textbf{y} &= \textbf{L}^{-1}\textbf{x},

\end{align*}

where $\textbf{S}$ is a $3\times 3$ symmetric and positive definite matrix.

Our data consists of a large number of vectors $\textbf{x}$ and matrices $\textbf{S}$, and we want to solve for $\textbf{y}$ for each of them. So we want a function to calculate this for every element.

The most important trick to remember for vectorizing code is that CPUs can efficiently load or operate on contiguous data, and “under the hood”, all data is stored linearly as a vector. That means in a matrix like:

```
reshape(collect(1:12), (4,3))
```

The 4 and 5 are next to each other in memory, while the 1 and 5, despite appearing next to each other in the matrix, are 4 appart. This is called column-major storage, because data within a column are next to one another.

Row-major storage is the opposite: data within a row are next to each other.

So, while trying to get a function to vectorize, we need to ask:

- What pattern would let us do multiple computations simultaneously?
- How then will we have to organize our data to let that happen?

Now, the standard advice you see online is, because `pdforwardsolve_fm`

operates on one vector and one matrix at a time, we should store our data so that the data in each of these are close together. For example:

Accessing the memory in the order in which it is stored will be faster. In Fortran, this means writing loops like

do j = 1, n do i = 1, n ! operate on A(i,j) ... end do end doOn the other hand, in C we want:

for (i = 0; i < n; i++ { for (j = 0; j < n; j++) { /* operate on A[i][j] ... */ } }

Or if you want something official looking, from MATLAB’s documentation:

Code segment 2 is faster because it traverses the elements of the 2-D array by going down the columns in the inner loop. If the algorithm permits, you can also maximize cache efficiency by using the single index method, x(k), instead of x(r,c).

So, lets try that. First, some code to generate a fake data set.

```
using StaticArrays, Random, BenchmarkTools
N = 2827
Y32 = Matrix{Float32}(undef, N, 3);
BPP32 = Matrix{Float32}(undef, N, 10);
@inline function randpd(::Type{T} = Float64, ::Val{N} = Val(3), ::Val{R} = Val(N)) where {T,N,R}
(@SMatrix randn(T, 2N, N)) |> x -> x' * x;
end
@inline function randpdv(::Type{T}) where T
sS = randpd(T, Val(3))
@inbounds SVector(sS[1], sS[4], sS[5], sS[7], sS[8], sS[9])
end
function fillpd!(X::AbstractMatrix{T}) where T
@inbounds for i ∈ 1:size(X,1)
X[i,:] = randpdv(T)
end
end
randn!(@view BPP32[:,1:4]);
fillpd!(@view BPP32[:,5:10])
BPP = copy(BPP32')
Y = Matrix{Float32}(undef, 3, N)
function juliatest_1!(Y::AbstractMatrix, BPP::AbstractMatrix)
@inbounds for i ∈ 1:size(BPP,2)
Y[:,i] .= pdforwardsolve(
BPP[1,i],BPP[2,i],BPP[3,i],
BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i]
)
end
end
@benchmark juliatest_1!($Y, $BPP)
```

Is 19 microseconds good? Hard to know without a baseline, but let’s take a quick glance at the assembly.

```
@code_native juliatest_1!(Y, BPP)
```

All we see are `xmm`

registers accompanied with instructions that have an “s” in the second-last position, meaning “single” element, not “packed” vector.

So the code is not vectorized at all. For good measure, let’s add an `@simd ivdep`

. The `ivdep`

makes a few promises to the compiler that each iteration of the for loop is independent $-$ that one iteration will not dependent on the result from a previous one.

```
function juliatest_2!(Y::AbstractMatrix, BPP::AbstractMatrix)
@inbounds @simd ivdep for i ∈ 1:size(BPP,2)
Y[:,i] .= pdforwardsolve(
BPP[1,i],BPP[2,i],BPP[3,i],
BPP[5,i],BPP[6,i],BPP[7,i],BPP[8,i],BPP[9,i],BPP[10,i]
)
end
end
@benchmark juliatest_2!($Y, $BPP)
```

No luck.

In this case, we can look at `pdforwardsolve`

, and see that while there is a lot of math, many of those calculations within the function depend on earlier answers. If we need one answer before getting to calculate the next, we can’t calculate them at the same time!

That more or less means that calculating one instance of `pdforwardsolve`

cannot be vectorized.

But, wait!

What about instead just calculating a bunch of `pdforwardsolve`

at in parallel, at the same time (on a single core)?

So, let’s try swapping the indices, and using the transposed versions of $\textbf{Z}$ and $\textbf{BPP}$:

```
function juliatest_3!(Y::AbstractMatrix, BPP::AbstractMatrix)
@inbounds @simd ivdep for i ∈ 1:size(BPP,1)
Y[i,:] .= pdforwardsolve(
BPP[i,1],BPP[i,2],BPP[i,3],
BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
)
end
end
@benchmark juliatest_3!($Y32, $BPP32)
```

Worse. This part of optimization can be tricky: you know what you want the compiler to do (vectorize), we know that it is possible, (why wouldn’t it be?), but the compiler isn’t cooperating.

The problem is that Julia gets confused. I think what is happening is that, with `ivdep`

, we promised that each of the arrays are distinct, and don’t overlap. That is, that Z and BPP are separate.

But, we have a whole lot of referenecs to the same BPP, and we’re writing into a slice of Z. So the compiler isn’t convinced that these don’t overlap or conflict in some way.

So let’s try and be a little more explicit.

Instead of pointing to a bunch of different indices within the same two arrays, we’re taking a view of each of the columns of these arrays. A view, like the name implies, is just a look at the parent. Meaning editing `Y1`

edits the first column of `Y`

, because `Y1`

*is* the first column of `Y`

.

But splitting everything appart in this way makes it more obvious that all these columns are in fact, separate.

```
function juliatest_4!(Y::AbstractMatrix{T}, Data::AbstractMatrix{T}) where T
@inbounds begin
Y1, Y2, Y3 = @views Y[:,1], Y[:,2], Y[:,3]
X1, X2, X3 = @views Data[:,1], Data[:,2], Data[:,3]
S_11, S_12, S_22 = @views Data[:,5], Data[:,6], Data[:,7]
S_13, S_23, S_33 = @views Data[:,8], Data[:,9], Data[:,10]
end
@inbounds @simd ivdep for i ∈ 1:size(Data,1)
(y1, y2, y3) = pdforwardsolve(
X1[i], X2[i], X3[i],
S_11[i], S_12[i], S_22[i], S_13[i], S_23[i], S_33[i]
)
Y1[i] = y1
Y2[i] = y2
Y3[i] = y3
end
end
@benchmark juliatest_4!($Y32, $BPP32)
```

So, now the compiler is suddenly confident enough to vectorize the code, and it’s 6 times faster! Nice.

Before seeing if we can make this even faster, I wanted to comment on something else necessary:

To vectorize, you have to do the same thing to all elements.

That means every “branch” where code could take more than one possible path, like in an “if, then” statement, causes a problem. It is possible to get around this, but it causes problems, and will often stop vectorization.

That might not seem like a big deal $-$ our code didn’t have any obvious branches, right?

Well…

```
sqrt(2.3)
```

```
sqrt(-2.3)
```

The code just did one of two different things.

It either returned a number, or through an error.

Or:

```
Y32[1000,1]
```

```
Y32[10000,1]
```

Again, one of two different things. These are branches, either of which can kill vectorization.

But the square root is especially insidious. For the bounds check, compilers are often smart enough to realize that if the very first and very last bounds check are okay (or maybe only the last), then all of them must be okay. If they notice that, they’ll only make the single check, and everything is okay.

Alternatively, `@inbounds`

tells the compiler not to check the boundaries of the array, and just trust everything is within bounds.

The `sqrt`

is harder, because how is the compiler supposed to know anything about the numbers that you’re sending in to that function? There isn’t a simple precheck it can use to figure out if things are going to be okay.

A solution there is `@fastmath`

, which tells the compiler (among other things) not to worry about checking square roots:

```
@fastmath sqrt(-2.3)
```

No branch, instead it returns the double precision number `NaN`

```
typeof(NaN)
```

otherwise, acting exactly as it normally does.

While we want both `@inbounds`

and `@fastmath`

while optimizing code, we’d prefer neither while debugging.

Errors make it easier to track down where a problem came from.

`sqrt`

s silently introducing infectious `NaN`

s can be unpleasant to track down. Even worse, Julia is likely to outright crash with a segmentation fault from `@inbounds`

, when the memory really was not inbounds. Note that `@inbounds`

only works inside of functions, so you can’t just try `@inbounds [1,2][1000]`

in the REPL.

These are tradeoffs and considerations you have to consider in every language. For example, in `C++`

, array boundaries are not checked by default $-$ you’d just get the crash!

Finally, the `@inline`

in front of the definition of `pdforwardsolve`

tells the compiler that when you write another function calling `pdforwardsolve`

, instead of actually calling a compiled `pdforwardsolve`

, you want to insert the code. The compiler will often decide to do that automatically, even without `@inline`

, but it is important to make sure that is happening when you vectorize code.

This is because, for the for loop to get vectorized, the compiler has to know what it is vectorizing, and generate that vectorized code. If all it has to work with is an opaque “call some other function named `pdforwardsolve`

“, the compiler wouldn’t know what to do with that.

I had written this code in a few different languages, including `C++`

, `Fortran`

, `Rust`

, and `ispc`

.

Anyway, `C++`

code doing the same thing:

```
#include <math.h>
inline void fpdforwardsolve(float* __restrict Z, float* __restrict BPP, int n, int N){
float x1 = BPP[n ];
float x2 = BPP[n + N];
float x3 = BPP[n + 2*N];
float S11 = BPP[n + 4*N];
float S12 = BPP[n + 5*N];
float S22 = BPP[n + 6*N];
float S13 = BPP[n + 7*N];
float S23 = BPP[n + 8*N];
float S33 = BPP[n + 9*N];
float R11 = 1 / sqrtf(S11);
float R11x1 = R11 * x1;
float L21 = R11 * S12;
float L31 = R11 * S13;
float R22 = 1 / sqrtf(S22 - L21*L21);
float L32 = R22 * (S23 - L21*L31);
float R33 = 1 / sqrtf(S33 - L31*L31 - L32*L32);
float nR21x1 = R22 * L21*R11x1;
float R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 );
float nR32 = R33 * L32 * R22;
Z[n ] = R11x1;
Z[n + N] = R22*x2 - nR21x1;
Z[n + 2*N] = R31x1 - nR32*x2 + R33*x3;
}
extern "C" {
void processBPP(float* __restrict Z, float* __restrict BPP, int N){
for (int n = 0; n < N; n++){
fpdforwardsolve(Z, BPP, n, N);
}
}
}
```

If you want to treat it as `C`

code:

- Get rid of the
`extern "C" { }`

surrounding the function`processBPP`

. That line tells a C++ compiler to externally represent that function as though it were`C`

.`C`

automatically externally represents its functions as though they were`C`

. - Delete the double underscores in front of the
`__restrict`

s, replacing them with a simply`restrict`

.

I compiled this code with the following two compiler commands, to test two different compilers:

`g++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test.so vectorization_test.cpp`

and

`clang++ -Ofast -march=native -shared -fPIC -mprefer-vector-width=512 -o libclangvectorization_test.so vectorization_test.cpp`

Run down of the commands:

`g++`

/`clang++`

: the name of the compilers.`g++`

is from the GNU compiler collection, while`clang`

is an “LLVM”-based C++ compiler. The LLVM project is a collection of tools meant to be modular. To get an idea of the aims and goals of LLVM, the first option under LLVM Tutorial is a tutorial for how to create your own language using LLVM. Surprise, surprise: Julia is also built on LLVM. As are languages like Swift, Rust, and ispc.`-Ofast`

: this turns on a lot of optimizations, as well as`@fastmath`

, turning off errors from math functions, and treats floating point math as though it were associative.`-march=native`

: optimize the code to run fast on the computer compiling the code. And it is okay if the code cannot run on other computer. If you plan on letting people download and run your compiled code, you can’t use this.`-shared -fPIC`

we want a shared library (requires`-fPIC`

). Languages like Julia and R can call compiled shared libraries (`.dll`

on Windows,`.dylib`

on Macs, and`.so`

on Linux).`-mprefer-vector-width=512`

gcc in particular doesn’t like using 512-bit vectors unless asked. LLVM is better about this. Don’t use this if your computer doesn’t support 512-bit vectors.`-o libclangvectorization_test.so`

name of the output shared library we are creating. Change the suffix based on your operating system.`vectorization_test.cpp`

the name of the file I saved the C++ code as.

To call the `C++`

code from Julia, all we need is what follows. We test to make sure we get the same answers as from Julia before we benchmark, because getting the right answer is more important than speed.

```
const cpptest_dir = "/home/chriselrod/Documents/progwork/julia/probcol/processBPP_tests"
const clanglib = joinpath(cpptest_dir, "libclangvectorization_test.so")
const gpplib = joinpath(cpptest_dir, "libgppvectorization_test.so")
function gpptest(Y, BPP, N)
ccall((:processBPP, gpplib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
function clangtest(Y, BPP, N)
ccall((:processBPP, clanglib), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
N32 = Int32(N)
fill!(Y, 0); juliatest_1!(Y, BPP); Y
```

```
fill!(Y32, 0); juliatest_4!(Y32, BPP32); Y32'
```

```
fill!(Y32, 0); gpptest(Y32, BPP32, N32); Y32'
```

```
fill!(Y32, 0); clangtest(Y32, BPP32, N32); Y32'
```

```
@benchmark gpptest($Y32, $BPP32, $N32)
```

```
@benchmark clangtest($Y32, $BPP32, $N32)
```

Wow! Clang smoked `Julia`

and `g++`

. What is going on?

Looking at the assembly reveals a trick both `g++`

and `Clang`

are using, but Julia is not.

Clang is using LLVM version 7, and Julia is LLVM version 6. Once `Julia`

upgrades, we should see it closer `Clang`

.

The trick is in our square roots, eg:

` float R11 = 1 / sqrtf(S11);`

Julia is calculating the vector of `R11`

values by taking the square root of a vector of `S11`

values, and then dividing a vector of 1s by this vector of square roots.

However, the the CPU I am running this code on has an innacurate instruction called `vrsqrt14ps`

. This instruction calculates an approximation to the recirprical of a square root, and is extremely fast. Unfortunately, it is just an approximation, with relative error of $10^{-14}$. So instead of using this value directly, it performs a Newton-Raphson iteration to cut down the error:

`rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)`

Now, I said both `g++`

and `Clang`

use this trick. Then why was `g++`

slower?

Because of a bug, where for my CPU the compiler doesn’t realize it can use this `rsqrt`

approximation. Instead, it does the same thing `Clang`

does to approximate the reciprocal square root…and then multiply that by the original number, to approximate the square root, and then it inverts that approximation to the square root using some other convoluted scheme. Yikes.

If we instead compile with:

`g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -o libgppvectorization_test2.so vectorization_test.cpp`

we don’t give it permission to approximate that square root calculation (while still turning off errors, to let the code be vectorized):

```
const gpplib2 = joinpath(cpptest_dir, "libgppvectorization_test2.so")
function gpptest2(Y, BPP, N)
ccall((:processBPP, gpplib2), Cvoid, (Ptr{Float32}, Ptr{Float32}, Int32), Y, BPP, N[])
end
fill!(Y32, 0); gpptest2(Y32, BPP32, N32); Y32'
```

```
@benchmark gpptest2($Y32, $BPP32, $N32)
```

So now it is the same fast as Julia, because it’s doing the same thing.

But hmm. We just realized that Clang (thanks to LLVM 7 $-$ LLVM 6 didn’t do that!) is faster than Julia. What can we do?

One choice is to wait until Julia upgrades to LLVM 7. Nah.

Another is to try and compile Julia from source to use LLVM 7. I don’t know enough about Julia’s internals to upgrade it myself. Julia needs a bunch of patches, so it requires a whole lot more work than just telling it to download a different version while building.

So, what does that leave us with?

Well, for starters, we can write our own `rsqrt`

function. In my library SIMDPirates, I have:

```
@inline function rsqrt_fast(x::NTuple{16,Core.VecElement{Float32}})
Base.llvmcall(
""" %rs = call <16 x float> asm "vrsqrt14ps \$1, \$0", "=x,x"(<16 x float> %0)
ret <16 x float> %rs""",
NTuple{16,Core.VecElement{Float32}}, Tuple{NTuple{16,Core.VecElement{Float32}}}, x)
end
@inline function rsqrt(x::NTuple{16,Core.VecElement{Float32}})
r = rsqrt_fast(x)
# Performs a Newton step to increase accuracy.
ns = vfma(vmul(r,r), x, -3.0f0)
vmul(vmul(-0.5f0, r), ns)
end
@inline rsqrt_fast(x::AbstractStructVec) = SVec(rsqrt_fast(extract_data(x)))
@inline rsqrt(x::AbstractStructVec) = SVec(rsqrt(extract_data(x)))
@inline rsqrt_fast(x) = @fastmath inv(sqrt(x))
@inline rsqrt(x) = @fastmath inv(sqrt(x))
```

`rsqrt_fast`

calls assembly (asm) instruction `vrsqrt14ps`

, and then `rsqrt`

calls this and adds the Newton Raphson iteratin. Nice!

But, then what about actually vectorizing a for loop? This function would have to be called with an actual vector of 16 32-bit floats to use that instruction. (And what about other functions where it is hard to convince the compiler to vectorize them?)

Well, I wrote a library LoopVectorization that forcably vectorizes some kinds of `for`

loops. Trying it out here:

```
using SIMDPirates, LoopVectorization
@inline function pdforwardsolve_nofm(x1,x2,x3,S11,S12,S22,S13,S23,S33)
R11 = rsqrt(S11)
R11x1 = R11 * x1
L21 = R11 * S12
L31 = R11 * S13
R22 = rsqrt(S22 - L21*L21)
L32 = R22 * (S23 - L21 * L31)
R33 = rsqrt(S33 - L31*L31 - L32*L32)
nR21x1 = R22 * L21 * R11x1
R31x1 = R33 * ( L32*nR21x1 - L31*R11x1 )
nR32 = R33 * L32 * R22
(
R11x1,
R22*x2 - nR21x1,
R31x1 - nR32*x2 + R33*x3
)
end
@generated function juliatest!(X::AbstractMatrix{T}, BPP::AbstractMatrix{T}) where T
quote
@vectorize $T for i ∈ 1:size(BPP,1)
X[i,:] .= pdforwardsolve_nofm(
BPP[i,1],BPP[i,2],BPP[i,3],
BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
)
end
end
end
fill!(Y32, 0); juliatest!(Y32, BPP32); Y32'
```

```
@benchmark juliatest!($Y32, $BPP32)
```

Not quite as good as Clang, but close!

We can use the macro `@macroexpand`

to expand the `@vectorize`

macro and see what it does. We’ll also use `striplines`

from the MacroTools library, because this stripes out a bunch of debugging information that clutters up the expansion.

```
using MacroTools: striplines
(@macroexpand @vectorize Float32 for i ∈ 1:size(BPP,1)
X[i,:] .= pdforwardsolve(
BPP[i,1],BPP[i,2],BPP[i,3],
BPP[i,5],BPP[i,6],BPP[i,7],BPP[i,8],BPP[i,9],BPP[i,10]
)
end) |> striplines
```

Odds are everything isn’t suddenly clear at a glance.

But what is happening is that it takes our for loop, and unrolls it, so that it can handle many iterations at a time with vectors. It also takes care of loading and storing these vectors from vectorizable forms of our input arrays.

Finally, because not all numbers are divisible by [length of the vectors], it handles the remaining iterations with a masked-operations. Meaning, if the for loop covered 106 iterations and the vectors were of length 16:

- The for loop would repeat 6 times.
- The last 10 elements would be loaded with a masked load, operations would be performed, and then the answers written with a masked store. The “masks” hide the excess elements, so that we don’t read or write from memory Julia does not have access to (which could trigger a segmentation fault).

Additionally, in forcing vectorization, we don’t need the `@fastmath`

flag. Actually, that flag currently makes the code slower. I’ll fix that sometime.

This macro is a final hammer that can force vectorization.

Other instances where it works is when loops contain functions like `log`

, `exp`

, or `sin`

, which (unlike `sqrt`

) aren’t natively supported on most CPUs.