Now, let’s revisit the Gibbs sampler, and see how much faster we can get it.

My Julia code depends on a lot of packages that I’ve written or adapted from others. They also assume they can find each other in Julia’s package-development folder, so you have to install them by saying you wish to develop them. From the Julia REPL:

```
] dev dev https://github.com/chriselrod/VectorizationBase.jl https://github.com/chriselrod/SIMDPirates.jl https://github.com/chriselrod/SLEEF.jl https://github.com/chriselrod/VectorizedRNG.jl https://github.com/chriselrod/ScatteredArrays.jl
; cd ~/.julia/dev/SLEEF
; git fetch origin
; git checkout -b SIMDPirates remotes/origin/SIMDPirates
] dev https://github.com/chriselrod/DifferentiableObjects.jl https://github.com/chriselrod/KernelDensityDistributionEsimates.jl https://github.com/chriselrod/LoopVectorization.jl https://github.com/chriselrod/CovarianceRealism.jl
```

On Windows, it is probably `; cd ~\\.julia\\dev\\SLEEF`

to navigate to that folder.

`]`

means enter Julia’s package-mode, which you do via hitting the `]`

key. Similarly, `;`

means to enter the command line mode, done via hitting `;`

.

I could write a fair bit about most of these libraries. (I should certainly write documentation for all of them…)

I’ll derail myself briefly just to talk about VectorizedRNG.

For comparison, RcppZiggurat discusses a variety of methods of generating random normals. The Ziggurat method is the second fastest, but unlike the fastest, has great statistical properties. Julia’s default `randn()`

uses the Ziggurat method, and does of course the R library `RcppZiggurat`

.

The method folks are likely the most familiar with is the well known Box-Muller transform. This was discussed, for example, in Casella & Berger’s Statistical Inference. Poor old Box Muller was found to be:

We see that the Box-Muller generator is the slowest by some margin

It isn’t exactly clear how wide the margin is. But we can see from the figures Box-Muller took over 300 ms for a million draws on their test computer, and many of the Ziggurat methods took well under 50 ms.

We can easily run R scripts from Julia, without even needing to load the `RCall`

library, via taking advantage of the `run`

function, which runs commands from the command line (alternatively, you could hit the semi-colon `:`

from a Julia REPL to enter the command-line mode). For example:

```
run(`Rscript -e "library(microbenchmark); microbenchmark(rnorm(1e6))"`)
```

```
run(`Rscript -e "library(microbenchmark); RNGkind(normal.kind = 'Box-Muller'); microbenchmark(rnorm(1e6))"`)
```

```
run(`Rscript -e "library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(1e6))"`)
```

That library certainly is much faster than the Box-Muller / `R`

‘s default. The default, being about as fast as Box-Muller, probably is Box-Muller. Also, recalling what I wrote earlier about memory bottlenecks for big functions $-$ I want to benchmark the random number generator. Not my computer’s memory bandwidth.

In code using the RNGs, you can generate smaller numbers of random numbers at a time, processing them in batches, to avoid getting throttled by slow memory. So lets try a vector of length 2000:

```
run(`Rscript -e "library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(2e3))"`)
```

Ah. That is a little faster. 10 microseconds for 2000. 500 times more random numbers takes over 5 milliseconds.

So, lets try the random number generator I wrote in Julia.

And, for reference, which method did I choose, wanting a fast algorithm? Box-Muller! Uh, not off to a good start.

```
using VectorizedRNG, BenchmarkTools, Random
x = Vector{Float64}(undef, 2000)
@benchmark randn!(VectorizedRNG.GLOBAL_vPCG, $x)
```

Box-Muller strikes back with a vengeance, over 3 times faster instead of 4 times slower!

A quick disclaimer: optimal performance requires the avx512f and avx512dq instruction sets. These make vectorized code much faster. Unfortuantely, not many CPUs have it yet, but it will hopefully start becoming mainstream later this year.

If you’d like to check your CPU flags on a Unix-lie system, try running something like:

```
run(`grep -m 1 avx512dq /proc/cpuinfo`)
```

This lists all the flags. If it didn’t contain `avx512dq`

, it wouldn’t have listed any at all.

We can also benchmark versus Julia’s Ziggurat method, for good measure:

```
@benchmark randn!($x)
```

We can break the speed differences down into the underlying uniform random number generator, and the normal generation.

RcppZiggurat uses the fast KISS) generator, which fails the linear complexity tests of the TestU01 and BigCrush suite.

VectorizedRNG uses PCGs, which are statistically robust, pass, and very efficient with avx512f.

Julia’s default uses a MersenneTwister, which is fairly robust, and when vectorized (as Julia is), very fast at generating uniforms:

```
@benchmark rand!($x)
```

The default `rand`

cannot be vectorized in a for loop, and it also isn’t vectorized while generating random normals. So let’s use a `for`

loop to get a baseline for amount of time it takes to generate 2000 random uniforms while sampling from a normal distribution:

```
function loop_rand!(x)
@inbounds for i ∈ eachindex(x)
x[i] = rand()
end
end
@benchmark loop_rand!($x)
```

```
@benchmark rand!(VectorizedRNG.GLOBAL_vPCG, $x)
```

Most of the runtime is clearly spent generating the normals from uniforms, and there Box-Muller is much faster: about 2.3 microseconds to turn uniforms into normals, while the difference between Ziggurat-normals and uniforms in base Julia is over twice that.

I’d expect the case to be similar for Rcpp, which was just a little slower than Julia’s Ziggurat.

Of course, if we’re sampling random numbers, it helps to have fast random number generators.

Digression asside, let’s talk Gibbs sampling!

Refresher of the three samplers:

We can split up our parameters into three blocks:

- The (Inverse-)Wisharts $(\boldsymbol{\Sigma})\ \boldsymbol{\Lambda}$. Conditioning on the group membership of each $\textbf{y}_n$, the (Inverse-)Wisharts are conjugate with the normal distribution on $\textbf{y}_n$.
- The proportion beloning to each group, $\boldsymbol{pi}$. Conditioning on the number of observations in each group, the Dirichlet-categorical distribution is conjugate.
- Group memberships. Conditioning on the proportions and inverse wisharts of each group, we can calculate the individual likelihood of each $\textbf{y}$ belonging in each group. Categorical distributions are straightforward to sample from.

Lets look at the code for each of these in CovarianceRealism.jl.

Note that there seems to be a bug with Markdown’s Julia support, that forces me to add a a few “\” to the text in front of certain “$” characters. Without them, it wont display properly. With them, it does, except for the fact that it has a bunch of “\” characters.

To update the Inverse-Wisharts based on group membership, we have

```
@generated function set_priors!(Λs::AbstractVector{InverseWishart{T}}, ::Val{NG}) where {NG, T}
quote
$(Expr(:meta, :inline))
@inbounds @nexprs \$NG g ->
Λs[g] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(\$(1/NG))*T(1.5)^(\$NG-g))
nothing
end
end
function calc_Wisharts!(revcholwisharts::AbstractVector{RevCholWishart{T}},
cholinvwisharts::AbstractVector{CholInvWishart{T}},
invwisharts::AbstractVector{InverseWishart{T}},
groups::Groups{NG}, rank1covs::Vector{InverseWishart{T}}) where {T,NG}
set_priors!(invwisharts, Val(NG))
for (i,g) ∈ enumerate(groups)
@inbounds invwisharts[g] += rank1covs[i]
end
inv_and_cholesky!(revcholwisharts, cholinvwisharts, invwisharts)
end
function inv_and_cholesky!(riwv::AbstractVector{RevCholWishart{T}},
ciwv::AbstractVector{CholInvWishart{T}},
iwv::AbstractVector{InverseWishart{T}}) where T
@inbounds for i ∈ eachindex(iwv)
iw = iwv[i]
@fastmath begin
L11 = sqrt(iw[1])
R11 = 1 / L11
L21 = R11 * iw[2]
L31 = R11 * iw[3]
L22 = sqrt(iw[4] - L21^2)
R22 = 1 / L22
L32 = R22 * (iw[5] - L21 * L31)
L33 = sqrt(iw[6] - L31^2 - L32^2)
R33 = 1 / L33
R21 = - R22 * L21 * R11
R31 = - R33 * ( L31*R11 + L32*R21 )
R32 = - R33 * L32 * R22
end
ciwv[i] = CholInvWishart(
L11, L21, L31, L22, L32, L33, iw[7], iw[8]
)
riwv[i] = RevCholWishart(
R11, R21, R31, R22, R32, R33, iw[7], iw[8]
)
end
end
```

First, we set the priors for the Inverse-Wisharts. The 8-elements of the Inverse-Wishart are the 6 elements from the lower triangle of the symmetric matrix, the degrees of freedom, and finally $\alpha_g$, the Dirichlet parameter for group membership. This last point is so that we sum up the posterior for the Dirichlet distribution at the same time as we calculate the Wisharts. The actual summation is a simple for loop in the function `calc_Wisharts!`

:

```
for (i,g) ∈ enumerate(groups)
@inbounds invwisharts[g] += rank1covs[i]
end
```

where it adds a rank-1 covariance matrix to the appropriate sum, indexing by group. The rank-1 covariances are the lower triangles of the products $\textbf{y}_i\textbf{y}_i’$, plus two ones to calculate the posterior degrees of freedom for the Wishart, and the Dirichlet posterior.

We also “unroll” the for loop from set_priors!:

```
using Base.Cartesian: @nexprs
@macroexpand @nexprs 6 g ->
Λs[g] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T((1/6))*T(1.5)^(6-g))
```

This has the advantage of letting the compiler recognize that every one of these is a constant, and then do the compuation only once, while compiling, and simply load those results every time.

Finally, I moved the computation of the Cholesky factor and it’s inverse to the `calc_Wisharts!`

function, thinking it’s a better match there.

Because the `Distributions`

library can be a little slow, I wrote my own Dirichlet and Gamma random number generators:

```
function dirichlet!(rng::AbstractRNG, probabilities::AbstractVector{T}, α::NTuple{N,T}) where {N,T}
cumulative_γ = zero(eltype(α))
@inbounds for i ∈ eachindex(α)
γ = randgamma(rng, α[i])
cumulative_γ += γ
probabilities[i] = γ
end
inv_cumulative_γ = 1 / cumulative_γ
@inbounds for i ∈ eachindex(α)
probabilities[i] *= inv_cumulative_γ
end
end
```

where `randgamma`

samples from the gamma distribution, using the accept-reject method:

```
randgamma(α::Number) = randgamma(Random.GLOBAL_RNG, α)
randgamma(α::Number, β::Number) = randgamma(Random.GLOBAL_RNG, α, β)
randgamma(rng::AbstractRNG, α::Number, β::Number) = randgamma(rng, α) * β
randgamma(rng::AbstractRNG, α::SIMDPirates.AbstractStructVec) = randgamma(rng, SIMDPirates.extract_data(α))
function randgamma(rng::AbstractRNG, α::Vec{W,T}) where {W,T}
SVec{W,T}(ntuple(w -> (@inbounds randgamma(rng, α[w].value)), Val(W)))
end
@inline function randgamma(rng::AbstractRNG, α::T) where T
α < one(T) ? exp(-randexp(rng, T)/α) * randgamma_g1(rng, α+one(T)) : randgamma_g1(rng, α)
end
@inline function randgamma_g1(rng::AbstractRNG, α::T) where T
OneThird = one(T)/T(3)
d = α - OneThird
@fastmath c = OneThird / sqrt(d)
@fastmath while true
x = randn(rng, T)
v = one(T) + c*x
v < zero(T) && continue
v3 = v^3
dv3 = d*v3
randexp(rng, T) > T(-0.5)*x^2 - d + dv3 - d*log(v3) && return dv3
end
end
```

This randgamma is actually much faster than `rand(Gamma(α, β))`

from Distributions.jl, which calls the same `C`

code used in `R`

‘s `rgamma`

.

Wikipedia discusses Dirichlet sampling using gamma distributions, as well as sampling from Gamma distributions.

Finally, the third component is sampling from the categorical distribution. I broke up the sampling into two parts: first calculating individual likelihoods for each observation, and then sampling.

```
@generated function update_individual_probs!(probabilities::AbstractMatrix{T},
baseπ::AbstractVector{T}, LiV::AbstractVector{RevCholWishart{T}},
ν::NTuple{NG,T}, x::AbstractMatrix{T}) where {T,NG}
quote
@inbounds for g ∈ 1:$NG
# L is scaled too large by a factor of √ν; cancels out 1/ν factor on quadratic form in t-pdf
exponent = T(-0.5) * ν[g] + T(-1.5)
Li = LiV[g]
Li11 = Li[1]
Li21, Li22 = Li[2], Li[4]
Li31, Li32, Li33 = Li[3], Li[5], Li[6]
base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) +
lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])
@vectorize $T for i ∈ 1:size(x, 1)
lx₁ = Li11*x[i,1]
lx₂ = Li21*x[i,1] + Li22*x[i,2]
lx₃ = Li31*x[i,1] + Li32*x[i,2] + Li33*x[i,3]
probabilities[i,g] = exp(base + exponent * log(one(T) + lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃))
end
end
end
end
```

We do that via marginalizing out the Inverse-Wisharts, getting trivariate T distributions, and then assessing their likelihoods. We evaluate this likelihood using the inverse of the Cholesky factor of the Inverse-Wishart distributions, which was calculated back in “1”, from the `inv_and_cholesky!`

function.

This actual code in the CovarianceRealism library is a little more complicated. To maximize performance, we need to minimize the amount of time spent moving memory in and out of registers. Many variables are constant across the iterations of the loop, so ideally we would load them only once. However, a CPU only has so many registers; most have 16 per core, while those with avx512 have 32. If it cannot hold all of them at once, a few “spill”, and need to be stored into memory and reloaded on every iteration of the loop.

When a function, like `exp`

or `log`

isn’t inlined, it will clear out all the registers. While the `@vectorize`

macro calls versions of these functions that are inlined, the functions still each use a lot of registers. It is faster to not do too much else in a `for`

loop with `exp`

or `log`

.

Furthermore, the `x[i,j]`

are the same across all the groups, so given enough registers to handle the different inverse cholesky matrices, it’s faster to do a few of them at a time and not reload the `x[i,j]`

s.

The resulting function is currently a little ugly, since I haven’t automatated those transformations.

On splitting up the loops: that means you’re passing over the data several times, loading and storing intermediate results. If there is too much data to fit in a high cache level, that could hurt performance, and you may instead want to increase the amount of computations per for loop to give the memory time to make its way to the CPU.

We only store 1 number per sample size each time. Because we’re working in single precision, that means we just need the sample size to be less than 8000 for it all to fit in the fastest cache level.

The sample sizes of all our data sets are smaller than this, so we are safe.

`@generated`

functions let you “meta-program”: write code that writes code. The cases I’m showing here are hopefully relatively straightforward; all I’m doing is inserting some symbols or types into code.

The group-sampler, conditioned on these likelihoods:

```
@generated function Random.rand!(rng::VectorizedRNG.PCG{M}, group::Groups{NG}, probabilities::AbstractMatrix{T}) where {M,NG,T}
W = LoopVectorization.pick_vector_width(T)
Vi = Vec{W,Int8}
Vf = Vec{W,T}
p_NG = Symbol(:p_, NG)
vp_NG = Symbol(:vp_, NG)
vg_NG = Symbol(:vg_, NG)
quote
@nexprs \$NG g -> vg_g = vbroadcast(\$Vi, g)
N = length(group)
pg = pointer(group)
for n ∈ 0:\$(M*W):\$(loop_max_expr(M, W, :N))
vufull = rand(rng, Vec{\$(M*W),\$T})
Base.Cartesian.@nexprs \$M u -> begin
vp_1 = vload(\$Vf, probabilities, n + (u-1)*\$W + 1)
@nexprs \$(NG-1) g -> vp_{g+1} = SIMDPirates.vadd(vp_g, vload(\$Vf, probabilities, (n+(u-1)*\$W,g+1)))
vu = SIMDPirates.vmul((Base.Cartesian.@ntuple \$W w -> @inbounds vufull[w + (u-1)*\$W]), \$vp_NG)
vg = \$vg_NG
@nexprs \$(NG-1) g -> vg = vifelse( SIMDPirates.vless(vu, vp_{\$NG-g}), vg_{\$NG-g}, vg )
vstore(vg, pg + n + (u-1)*\$W)
end
end
# back tracking loop. This approach imposes a minimum sample size limit.
# for n ∈ (N - (N & \$W) - ((N & \$(W-1)) > 0) * \$W):\$W:N-\$W
# because we preallocate extra space, we can instead extend past the end without
# segfaulting. This however forces us to intialize probs data as zeros, to avoid subnormals.
for n ∈ \$(remloop_min_expr(M, W, :N)):\$W:N-1
vp_1 = vload(\$Vf, probabilities, n)
@nexprs \$(NG-1) g -> vp_{g+1} = SIMDPirates.vadd(vp_g, vload(\$Vf, probabilities, (n,g+1)))
vu = SIMDPirates.vmul(rand(rng, Vec{\$W,$T}), \$vp_NG)
vg = \$vg_NG
@nexprs \$(NG-1) g -> vg = vifelse( SIMDPirates.vless(vu, vp_{\$NG-g}), vg_{\$NG-g}, vg )
vstore(vg, pg + n)
end
end
end
```

This one is the ugliest, because the easiest way I found to write it was to write a generated function, that produces the correct code depending on the types of the inputs and the CPU of the computer it is running on.

Let’s not dwell on the implementation details at the moment.

The basic approach is to generate random uniform samples, calculate the discrete CDF from the likelihoods, and compare it to the uniform samples to pick the groups. A simple optimization is to, instead of normalizing the CDF to equal one via dividing by the final sum, multiply the uniform probabilities by the sum instead. Multiplication is much faster than division.

All these operations can of course easily be vectorized. I do that explicitly in the code above, in a manner similar to what the `@vectorize`

macro does.

Okay, let’s try running this and seeing how we do by loading the same fake dataset we used in the earlier `R`

and basic Julia demos:

```
using DelimitedFiles, CovarianceRealism, BenchmarkTools, StaticArrays, Random
BigPropPoints = readdlm(joinpath("/home/chriselrod/Documents/progwork/R", "BigPropPoints.csv"), ',', Float64)
BPP32 = Float32.(BigPropPoints')
num_groups = 6
N = size(BPP32, 1); mcmc_iter = 10^4
workingdata = CovarianceRealism.WorkingData(N, Float32, Val(num_groups))
mcmcres = CovarianceRealism.MCMCResult(num_groups, mcmc_iter, Float32)
Y = Matrix{Float32}(undef, N, 3)
rank1covs = Vector{InverseWishart{Float32}}(undef, N);
CovarianceRealism.process_big_prop_points!(Y, BPP32)
rank1covs = Vector{InverseWishart{Float32}}(undef, N);
CovarianceRealism.generate_rank1covariances!(rank1covs, Y)
@generated function base_π(::Val{N}, ::Type{T}) where {N,T}
SVector(ntuple(Val(N)) do n T(1.5^(N-n)) end) |> x -> x / sum(x)
end
baseπ = base_π(Val(num_groups), Float32)
invwisharts, individual_probs, groups =
workingdata.inverse_wisharts, workingdata.individual_probs, workingdata.groups;
probs, revcholwisharts, cholinvwisharts =
mcmcres.Probs, mcmcres.RevCholWisharts, mcmcres.CholInvWisharts;
scalar_rng = CovarianceRealism.GLOBAL_PCG.scalar
vector_rng = CovarianceRealism.GLOBAL_PCG.vector
rcw = @views revcholwisharts[:,1];
ciw = @views cholinvwisharts[:,1];
pcj = @views probs[:,1];
@benchmark rand!($vector_rng, $groups, $baseπ)
```

Now that we have picked random groups from the initial, prior we can benchmark the components of our Gibbs sampler:

```
@benchmark CovarianceRealism.calc_Wisharts!($rcw, $ciw, $invwisharts, $groups, $rank1covs)
```

```
eα = CovarianceRealism.extract_α(invwisharts, Val(num_groups));
@benchmark CovarianceRealism.randdirichlet!($scalar_rng, $pcj, $eα)
```

```
eν = CovarianceRealism.extract_ν(invwisharts, Val(num_groups));
@benchmark CovarianceRealism.update_individual_probs!($individual_probs, $pcj, $rcw, $eν, $Y)
```

```
length(individual_probs), size(individual_probs), 1024 * 6
```

```
@benchmark CovarianceRealism.rand!($vector_rng, $groups, $individual_probs)
```

Subjectively, those times look fast. So, lets run the entire sampler on a single core for the same number of iterations and warmup we ran JAGS, and time it.

```
@benchmark CovarianceRealism.singlechain_sample!($mcmcres, $Y, $rank1covs, $workingdata, $BPP32, $baseπ, $mcmc_iter, 4000)
```

That is 80 ms. JAGS took 138 seconds.

```
# seconds * ms/second / ms
138 * 1000 / 80
```

We are now roughly 1700 times faster than JAGS. Sweet!

```
# seconds * ms/second / ms
2.8 * 1000 / 80
```

and about 35 times faster than our simple, unoptimized Julia code.

Additionally, I also wrote a C++ version of this Gibbs model to serve as a basis for comparison (the master branch requires avx512, while the `xsimd` branch only requires avx). I wanted to optimize it for the same platform to get a fair comparison, but I also did not want to spend the time to make it appropriately generic (able to run on most computers) like I did for the optimized Julia code.

The only `C++`

feature I used that doesn’t have an equivalent in `C`

is templates. I used templates to make it easy to compile multiple separate versions of the sampler, for different desired numbers of mixture components. Although at the moment I only compiled it to use 6, to compare with the JAGS and Julia code.

I support both the Clang and GNU `C++`

compilers. Calling `C`

from Julia is straighforward:

```
const CLANGBUILDDIR = "/home/chriselrod/Documents/progwork/Cxx/ClangCovarBuildDir";
const libgibbcovarrealismclang = joinpath(CLANGBUILDDIR,"libgibbcovarrealism.so");
function sample_6clang!(probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter)
ccall((:sample_6groups, libgibbcovarrealismclang), Cvoid,
(Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Int,Int,Int),
probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter
)
end
probs = mcmcres.Probs;
revcholwisharts = mcmcres.RevCholWisharts;
cholinvwisharts = mcmcres.CholInvWisharts;
base_p = Array(baseπ);
sample_6clang!(probs, revcholwisharts, cholinvwisharts, Y, BPP32, base_p, N, 4000, mcmc_iter)
```

```
@benchmark sample_6clang!($probs, $revcholwisharts, $cholinvwisharts, $Y, $BPP32, $base_p, $N, 4000, $mcmc_iter)
```

```
const GNUBUILDDIR = "/home/chriselrod/Documents/progwork/Cxx/gccCovarBuildDir"
const libgibbcovarrealismgnu = joinpath(GNUBUILDDIR,"libgibbcovarrealism.so");
function sample_6gnu!(probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter)
ccall((:sample_6groups, libgibbcovarrealismgnu), Cvoid,
(Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Int,Int,Int),
probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter
)
end
@benchmark sample_6gnu!($probs, $revcholwisharts, $cholinvwisharts, $Y, $BPP32, $base_p, $N, 4000, $mcmc_iter)
```

```
138 * 1000 / 66.9
```

```
138 * 1000 / 76.7
```

Clang was around 2000 times faster (I realized later that the best run, which I put in the post with the JAGS run was actually 131 seconds, not 138)! That is the difference between a simulation running over night, versus taking two years!

Or the time it takes to take a sip of coffee, versus 3 hours.