OptimizedGibbs








Now, let’s revisit the Gibbs sampler, and see how much faster we can get it.

My Julia code depends on a lot of packages that I’ve written or adapted from others. They also assume they can find each other in Julia’s package-development folder, so you have to install them by saying you wish to develop them. From the Julia REPL:

] dev dev https://github.com/chriselrod/VectorizationBase.jl https://github.com/chriselrod/SIMDPirates.jl https://github.com/chriselrod/SLEEF.jl https://github.com/chriselrod/VectorizedRNG.jl https://github.com/chriselrod/ScatteredArrays.jl
; cd ~/.julia/dev/SLEEF
; git fetch origin
; git checkout -b SIMDPirates remotes/origin/SIMDPirates
] dev https://github.com/chriselrod/DifferentiableObjects.jl https://github.com/chriselrod/KernelDensityDistributionEsimates.jl https://github.com/chriselrod/LoopVectorization.jl https://github.com/chriselrod/CovarianceRealism.jl

On Windows, it is probably ; cd ~\\.julia\\dev\\SLEEF to navigate to that folder.

] means enter Julia’s package-mode, which you do via hitting the ] key. Similarly, ; means to enter the command line mode, done via hitting ;.
I could write a fair bit about most of these libraries. (I should certainly write documentation for all of them…)

I’ll derail myself briefly just to talk about VectorizedRNG.
For comparison, RcppZiggurat discusses a variety of methods of generating random normals. The Ziggurat method is the second fastest, but unlike the fastest, has great statistical properties. Julia’s default randn() uses the Ziggurat method, and does of course the R library RcppZiggurat.

The method folks are likely the most familiar with is the well known Box-Muller transform. This was discussed, for example, in Casella & Berger’s Statistical Inference. Poor old Box Muller was found to be:

We see that the Box-Muller generator is the slowest by some margin

It isn’t exactly clear how wide the margin is. But we can see from the figures Box-Muller took over 300 ms for a million draws on their test computer, and many of the Ziggurat methods took well under 50 ms.
We can easily run R scripts from Julia, without even needing to load the RCall library, via taking advantage of the run function, which runs commands from the command line (alternatively, you could hit the semi-colon : from a Julia REPL to enter the command-line mode). For example:

In [1]:
run(`Rscript -e "library(microbenchmark); microbenchmark(rnorm(1e6))"`)
Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 rnorm(1e+06) 46.41033 46.47433 46.90736 46.52105 46.55983 49.50337   100
Out[1]:
Process(`Rscript -e 'library(microbenchmark); microbenchmark(rnorm(1e6))'`, ProcessExited(0))
In [2]:
run(`Rscript -e "library(microbenchmark); RNGkind(normal.kind = 'Box-Muller'); microbenchmark(rnorm(1e6))"`)
Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval
 rnorm(1e+06) 42.99451 43.22834 43.62269 43.25788 43.28881 46.41808   100
Out[2]:
Process(`Rscript -e "library(microbenchmark); RNGkind(normal.kind = 'Box-Muller'); microbenchmark(rnorm(1e6))"`, ProcessExited(0))
In [3]:
run(`Rscript -e "library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(1e6))"`)
Unit: milliseconds
          expr      min       lq     mean   median       uq      max neval
 zrnorm(1e+06) 6.580412 6.617218 7.012221 6.632097 6.646613 9.511995   100
Out[3]:
Process(`Rscript -e 'library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(1e6))'`, ProcessExited(0))

That library certainly is much faster than the Box-Muller / R‘s default. The default, being about as fast as Box-Muller, probably is Box-Muller. Also, recalling what I wrote earlier about memory bottlenecks for big functions $-$ I want to benchmark the random number generator. Not my computer’s memory bandwidth.
In code using the RNGs, you can generate smaller numbers of random numbers at a time, processing them in batches, to avoid getting throttled by slow memory. So lets try a vector of length 2000:

In [4]:
run(`Rscript -e "library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(2e3))"`)
Unit: microseconds
         expr    min      lq     mean  median    uq    max neval
 zrnorm(2000) 10.448 10.9475 11.99211 11.1805 11.45 80.249   100
Out[4]:
Process(`Rscript -e 'library(RcppZiggurat); library(microbenchmark); microbenchmark(zrnorm(2e3))'`, ProcessExited(0))

Ah. That is a little faster. 10 microseconds for 2000. 500 times more random numbers takes over 5 milliseconds.

So, lets try the random number generator I wrote in Julia.
And, for reference, which method did I choose, wanting a fast algorithm? Box-Muller! Uh, not off to a good start.

In [5]:
using VectorizedRNG, BenchmarkTools, Random
x = Vector{Float64}(undef, 2000)

@benchmark randn!(VectorizedRNG.GLOBAL_vPCG, $x)
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.731 μs (0.00% GC)
  median time:      2.742 μs (0.00% GC)
  mean time:        2.747 μs (0.00% GC)
  maximum time:     5.103 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

Box-Muller strikes back with a vengeance, over 3 times faster instead of 4 times slower!

A quick disclaimer: optimal performance requires the avx512f and avx512dq instruction sets. These make vectorized code much faster. Unfortuantely, not many CPUs have it yet, but it will hopefully start becoming mainstream later this year.

If you’d like to check your CPU flags on a Unix-lie system, try running something like:

In [6]:
run(`grep -m 1 avx512dq /proc/cpuinfo`)
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
Out[6]:
Process(`grep -m 1 avx512dq /proc/cpuinfo`, ProcessExited(0))

This lists all the flags. If it didn’t contain avx512dq, it wouldn’t have listed any at all.

We can also benchmark versus Julia’s Ziggurat method, for good measure:

In [7]:
@benchmark randn!($x)
Out[7]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.644 μs (0.00% GC)
  median time:      7.996 μs (0.00% GC)
  mean time:        8.010 μs (0.00% GC)
  maximum time:     12.050 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

We can break the speed differences down into the underlying uniform random number generator, and the normal generation.
RcppZiggurat uses the fast KISS) generator, which fails the linear complexity tests of the TestU01 and BigCrush suite.
VectorizedRNG uses PCGs, which are statistically robust, pass, and very efficient with avx512f.
Julia’s default uses a MersenneTwister, which is fairly robust, and when vectorized (as Julia is), very fast at generating uniforms:

In [8]:
@benchmark rand!($x)
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     909.385 ns (0.00% GC)
  median time:      914.744 ns (0.00% GC)
  mean time:        918.660 ns (0.00% GC)
  maximum time:     1.407 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     39

The default rand cannot be vectorized in a for loop, and it also isn’t vectorized while generating random normals. So let’s use a for loop to get a baseline for amount of time it takes to generate 2000 random uniforms while sampling from a normal distribution:

In [9]:
function loop_rand!(x)
    @inbounds for i  eachindex(x)
        x[i] = rand()
    end
end
@benchmark loop_rand!($x)
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.155 μs (0.00% GC)
  median time:      2.260 μs (0.00% GC)
  mean time:        2.388 μs (0.00% GC)
  maximum time:     4.118 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9
In [10]:
@benchmark rand!(VectorizedRNG.GLOBAL_vPCG, $x)
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     540.420 ns (0.00% GC)
  median time:      545.399 ns (0.00% GC)
  mean time:        544.692 ns (0.00% GC)
  maximum time:     853.527 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     188

Most of the runtime is clearly spent generating the normals from uniforms, and there Box-Muller is much faster: about 2.3 microseconds to turn uniforms into normals, while the difference between Ziggurat-normals and uniforms in base Julia is over twice that.

I’d expect the case to be similar for Rcpp, which was just a little slower than Julia’s Ziggurat.

Of course, if we’re sampling random numbers, it helps to have fast random number generators.
Digression asside, let’s talk Gibbs sampling!

Refresher of the three samplers:

We can split up our parameters into three blocks:

  1. The (Inverse-)Wisharts $(\boldsymbol{\Sigma})\ \boldsymbol{\Lambda}$. Conditioning on the group membership of each $\textbf{y}_n$, the (Inverse-)Wisharts are conjugate with the normal distribution on $\textbf{y}_n$.
  2. The proportion beloning to each group, $\boldsymbol{pi}$. Conditioning on the number of observations in each group, the Dirichlet-categorical distribution is conjugate.
  3. Group memberships. Conditioning on the proportions and inverse wisharts of each group, we can calculate the individual likelihood of each $\textbf{y}$ belonging in each group. Categorical distributions are straightforward to sample from.

Lets look at the code for each of these in CovarianceRealism.jl.
Note that there seems to be a bug with Markdown’s Julia support, that forces me to add a a few “\” to the text in front of certain “$” characters. Without them, it wont display properly. With them, it does, except for the fact that it has a bunch of “\” characters.
To update the Inverse-Wisharts based on group membership, we have

@generated function set_priors!(Λs::AbstractVector{InverseWishart{T}}, ::Val{NG}) where {NG, T}
    quote
        $(Expr(:meta, :inline))
        @inbounds @nexprs \$NG g ->
            Λs[g] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(\$(1/NG))*T(1.5)^(\$NG-g))
        nothing
    end
end
function calc_Wisharts!(revcholwisharts::AbstractVector{RevCholWishart{T}},
                        cholinvwisharts::AbstractVector{CholInvWishart{T}},
                        invwisharts::AbstractVector{InverseWishart{T}},
                        groups::Groups{NG}, rank1covs::Vector{InverseWishart{T}}) where {T,NG}
    set_priors!(invwisharts, Val(NG))

    for (i,g)  enumerate(groups)
        @inbounds invwisharts[g] += rank1covs[i]
    end

    inv_and_cholesky!(revcholwisharts, cholinvwisharts, invwisharts)
end
function inv_and_cholesky!(riwv::AbstractVector{RevCholWishart{T}},
                                    ciwv::AbstractVector{CholInvWishart{T}},
                                    iwv::AbstractVector{InverseWishart{T}}) where T

    @inbounds for i  eachindex(iwv)
        iw = iwv[i]
        @fastmath begin
            L11 = sqrt(iw[1])
            R11 = 1 / L11
            L21 = R11 * iw[2]
            L31 = R11 * iw[3]
            L22 = sqrt(iw[4] - L21^2)
            R22 = 1 / L22
            L32 = R22 * (iw[5] - L21 * L31)
            L33 = sqrt(iw[6] - L31^2 - L32^2)
            R33 = 1 / L33

            R21 = - R22 * L21 * R11
            R31 = - R33 * ( L31*R11 + L32*R21 )
            R32 = - R33 * L32 * R22
        end
        ciwv[i] = CholInvWishart(
            L11, L21, L31, L22, L32, L33, iw[7], iw[8]
        )
        riwv[i] = RevCholWishart(
            R11, R21, R31, R22, R32, R33, iw[7], iw[8]
        )
    end
end

First, we set the priors for the Inverse-Wisharts. The 8-elements of the Inverse-Wishart are the 6 elements from the lower triangle of the symmetric matrix, the degrees of freedom, and finally $\alpha_g$, the Dirichlet parameter for group membership. This last point is so that we sum up the posterior for the Dirichlet distribution at the same time as we calculate the Wisharts. The actual summation is a simple for loop in the function calc_Wisharts!:

for (i,g)  enumerate(groups)
        @inbounds invwisharts[g] += rank1covs[i]
    end

where it adds a rank-1 covariance matrix to the appropriate sum, indexing by group. The rank-1 covariances are the lower triangles of the products $\textbf{y}_i\textbf{y}_i’$, plus two ones to calculate the posterior degrees of freedom for the Wishart, and the Dirichlet posterior.

We also “unroll” the for loop from set_priors!:

In [11]:
using Base.Cartesian: @nexprs
@macroexpand @nexprs 6 g ->
            Λs[g] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T((1/6))*T(1.5)^(6-g))
Out[11]:
quote
    Λs[1] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 5)
    Λs[2] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 4)
    Λs[3] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 3)
    Λs[4] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 2)
    Λs[5] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 1)
    Λs[6] = InverseWishart(one(T), zero(T), zero(T), one(T), zero(T), one(T), T(3), T(0.16666666666666666) * T(1.5) ^ 0)
end

This has the advantage of letting the compiler recognize that every one of these is a constant, and then do the compuation only once, while compiling, and simply load those results every time.

Finally, I moved the computation of the Cholesky factor and it’s inverse to the calc_Wisharts! function, thinking it’s a better match there.

Because the Distributions library can be a little slow, I wrote my own Dirichlet and Gamma random number generators:

function dirichlet!(rng::AbstractRNG, probabilities::AbstractVector{T}, α::NTuple{N,T}) where {N,T}
    cumulative_γ = zero(eltype(α))
    @inbounds for i  eachindex(α)
        γ = randgamma(rng, α[i])
        cumulative_γ += γ
        probabilities[i] = γ
    end
    inv_cumulative_γ = 1 / cumulative_γ
    @inbounds for i  eachindex(α)
        probabilities[i] *= inv_cumulative_γ
    end
end

where randgamma samples from the gamma distribution, using the accept-reject method:

randgamma(α::Number) = randgamma(Random.GLOBAL_RNG, α)
randgamma(α::Number, β::Number) = randgamma(Random.GLOBAL_RNG, α, β)
randgamma(rng::AbstractRNG, α::Number, β::Number) = randgamma(rng, α) * β
randgamma(rng::AbstractRNG, α::SIMDPirates.AbstractStructVec) = randgamma(rng, SIMDPirates.extract_data(α))
function randgamma(rng::AbstractRNG, α::Vec{W,T}) where {W,T}
    SVec{W,T}(ntuple(w -> (@inbounds randgamma(rng, α[w].value)), Val(W)))
end
@inline function randgamma(rng::AbstractRNG, α::T) where T
    α < one(T) ? exp(-randexp(rng, T)/α) * randgamma_g1(rng, α+one(T)) : randgamma_g1(rng, α)
end
@inline function randgamma_g1(rng::AbstractRNG, α::T) where T
    OneThird = one(T)/T(3)
    d = α - OneThird
    @fastmath c = OneThird / sqrt(d)
    @fastmath while true
        x = randn(rng, T)
        v = one(T) + c*x
        v < zero(T) && continue
        v3 = v^3
        dv3 = d*v3
        randexp(rng, T) > T(-0.5)*x^2 - d + dv3 - d*log(v3) && return dv3
    end
end

This randgamma is actually much faster than rand(Gamma(α, β)) from Distributions.jl, which calls the same C code used in R‘s rgamma.

Wikipedia discusses Dirichlet sampling using gamma distributions, as well as sampling from Gamma distributions.

Finally, the third component is sampling from the categorical distribution. I broke up the sampling into two parts: first calculating individual likelihoods for each observation, and then sampling.

@generated function update_individual_probs!(probabilities::AbstractMatrix{T},
                                baseπ::AbstractVector{T}, LiV::AbstractVector{RevCholWishart{T}},
                                ν::NTuple{NG,T}, x::AbstractMatrix{T}) where {T,NG}
    quote
        @inbounds for g  1:$NG
            # L is scaled too large by a factor of √ν; cancels out 1/ν factor on quadratic form in t-pdf
            exponent = T(-0.5) * ν[g] + T(-1.5)
            Li = LiV[g]
            Li11             = Li[1]
            Li21, Li22       = Li[2], Li[4]
            Li31, Li32, Li33 = Li[3], Li[5], Li[6]

            base = log(baseπ[g]) + log(Li11) + log(Li22) + log(Li33) +
                    lgamma(-exponent) - lgamma(T(0.5)*ν[g]) - T(1.5)*log(ν[g])

            @vectorize $T for i  1:size(x, 1)
                lx₁ = Li11*x[i,1]
                lx₂ = Li21*x[i,1] + Li22*x[i,2]
                lx₃ = Li31*x[i,1] + Li32*x[i,2] + Li33*x[i,3]
                probabilities[i,g] = exp(base + exponent * log(one(T) + lx₁*lx₁ + lx₂*lx₂ + lx₃*lx₃))
            end
        end
    end
end

We do that via marginalizing out the Inverse-Wisharts, getting trivariate T distributions, and then assessing their likelihoods. We evaluate this likelihood using the inverse of the Cholesky factor of the Inverse-Wishart distributions, which was calculated back in “1”, from the inv_and_cholesky! function.

This actual code in the CovarianceRealism library is a little more complicated. To maximize performance, we need to minimize the amount of time spent moving memory in and out of registers. Many variables are constant across the iterations of the loop, so ideally we would load them only once. However, a CPU only has so many registers; most have 16 per core, while those with avx512 have 32. If it cannot hold all of them at once, a few “spill”, and need to be stored into memory and reloaded on every iteration of the loop.
When a function, like exp or log isn’t inlined, it will clear out all the registers. While the @vectorize macro calls versions of these functions that are inlined, the functions still each use a lot of registers. It is faster to not do too much else in a for loop with exp or log.
Furthermore, the x[i,j] are the same across all the groups, so given enough registers to handle the different inverse cholesky matrices, it’s faster to do a few of them at a time and not reload the x[i,j]s.
The resulting function is currently a little ugly, since I haven’t automatated those transformations.

On splitting up the loops: that means you’re passing over the data several times, loading and storing intermediate results. If there is too much data to fit in a high cache level, that could hurt performance, and you may instead want to increase the amount of computations per for loop to give the memory time to make its way to the CPU.
We only store 1 number per sample size each time. Because we’re working in single precision, that means we just need the sample size to be less than 8000 for it all to fit in the fastest cache level.
The sample sizes of all our data sets are smaller than this, so we are safe.

@generated functions let you “meta-program”: write code that writes code. The cases I’m showing here are hopefully relatively straightforward; all I’m doing is inserting some symbols or types into code.

The group-sampler, conditioned on these likelihoods:

@generated function Random.rand!(rng::VectorizedRNG.PCG{M}, group::Groups{NG}, probabilities::AbstractMatrix{T}) where {M,NG,T}
    W = LoopVectorization.pick_vector_width(T)
    Vi = Vec{W,Int8}
    Vf = Vec{W,T}
    p_NG = Symbol(:p_, NG)
    vp_NG = Symbol(:vp_, NG)
    vg_NG = Symbol(:vg_, NG)
    quote
        @nexprs \$NG g -> vg_g = vbroadcast(\$Vi, g)
        N = length(group)

        pg = pointer(group)
        for n  0:\$(M*W):\$(loop_max_expr(M, W, :N))
            vufull = rand(rng, Vec{\$(M*W),\$T})
            Base.Cartesian.@nexprs \$M u -> begin
                vp_1 = vload(\$Vf, probabilities, n + (u-1)*\$W + 1)
                @nexprs \$(NG-1) g -> vp_{g+1} = SIMDPirates.vadd(vp_g, vload(\$Vf, probabilities, (n+(u-1)*\$W,g+1)))
                vu = SIMDPirates.vmul((Base.Cartesian.@ntuple \$W w -> @inbounds vufull[w + (u-1)*\$W]), \$vp_NG)
                vg = \$vg_NG
                @nexprs \$(NG-1) g -> vg = vifelse( SIMDPirates.vless(vu, vp_{\$NG-g}), vg_{\$NG-g}, vg )
                vstore(vg, pg + n + (u-1)*\$W)
            end
        end
        # back tracking loop. This approach imposes a minimum sample size limit.
        # for n ∈ (N - (N & \$W) - ((N & \$(W-1)) > 0) * \$W):\$W:N-\$W
        # because we preallocate extra space, we can instead extend past the end without
        # segfaulting. This however forces us to intialize probs data as zeros, to avoid subnormals.
        for n  \$(remloop_min_expr(M, W, :N)):\$W:N-1
            vp_1 = vload(\$Vf, probabilities, n)
            @nexprs \$(NG-1) g -> vp_{g+1} = SIMDPirates.vadd(vp_g, vload(\$Vf, probabilities, (n,g+1)))
            vu = SIMDPirates.vmul(rand(rng, Vec{\$W,$T}), \$vp_NG)
            vg = \$vg_NG
            @nexprs \$(NG-1) g -> vg = vifelse( SIMDPirates.vless(vu, vp_{\$NG-g}), vg_{\$NG-g}, vg )
            vstore(vg, pg + n)
        end
    end
end

This one is the ugliest, because the easiest way I found to write it was to write a generated function, that produces the correct code depending on the types of the inputs and the CPU of the computer it is running on.
Let’s not dwell on the implementation details at the moment.

The basic approach is to generate random uniform samples, calculate the discrete CDF from the likelihoods, and compare it to the uniform samples to pick the groups. A simple optimization is to, instead of normalizing the CDF to equal one via dividing by the final sum, multiply the uniform probabilities by the sum instead. Multiplication is much faster than division.

All these operations can of course easily be vectorized. I do that explicitly in the code above, in a manner similar to what the @vectorize macro does.

Okay, let’s try running this and seeing how we do by loading the same fake dataset we used in the earlier R and basic Julia demos:

In [12]:
using DelimitedFiles, CovarianceRealism, BenchmarkTools, StaticArrays, Random
BigPropPoints = readdlm(joinpath("/home/chriselrod/Documents/progwork/R", "BigPropPoints.csv"), ',', Float64)
BPP32 = Float32.(BigPropPoints')

num_groups = 6
N = size(BPP32, 1); mcmc_iter = 10^4

workingdata = CovarianceRealism.WorkingData(N, Float32, Val(num_groups))
mcmcres = CovarianceRealism.MCMCResult(num_groups, mcmc_iter, Float32)
Y = Matrix{Float32}(undef, N, 3)
rank1covs = Vector{InverseWishart{Float32}}(undef, N);

CovarianceRealism.process_big_prop_points!(Y, BPP32)
rank1covs = Vector{InverseWishart{Float32}}(undef, N);
CovarianceRealism.generate_rank1covariances!(rank1covs, Y)

@generated function base_π(::Val{N}, ::Type{T}) where {N,T}
    SVector(ntuple(Val(N)) do n T(1.5^(N-n)) end) |> x -> x / sum(x)
end
baseπ =  base_π(Val(num_groups), Float32)

invwisharts, individual_probs, groups =
    workingdata.inverse_wisharts, workingdata.individual_probs, workingdata.groups;
probs, revcholwisharts, cholinvwisharts =
                mcmcres.Probs, mcmcres.RevCholWisharts, mcmcres.CholInvWisharts;


scalar_rng = CovarianceRealism.GLOBAL_PCG.scalar
vector_rng = CovarianceRealism.GLOBAL_PCG.vector

rcw = @views revcholwisharts[:,1];
ciw = @views cholinvwisharts[:,1];
pcj = @views probs[:,1];

@benchmark rand!($vector_rng, $groups, $baseπ)
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     278.048 ns (0.00% GC)
  median time:      278.707 ns (0.00% GC)
  mean time:        279.142 ns (0.00% GC)
  maximum time:     332.969 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     294

Now that we have picked random groups from the initial, prior we can benchmark the components of our Gibbs sampler:

In [13]:
@benchmark CovarianceRealism.calc_Wisharts!($rcw, $ciw, $invwisharts, $groups, $rank1covs)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     886.292 ns (0.00% GC)
  median time:      889.500 ns (0.00% GC)
  mean time:        890.989 ns (0.00% GC)
  maximum time:     1.246 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     48
In [14]:
 = CovarianceRealism.extract_α(invwisharts, Val(num_groups));
@benchmark CovarianceRealism.randdirichlet!($scalar_rng, $pcj, $)
Out[14]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     112.133 ns (0.00% GC)
  median time:      113.907 ns (0.00% GC)
  mean time:        114.166 ns (0.00% GC)
  maximum time:     132.020 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     927
In [15]:
 = CovarianceRealism.extract_ν(invwisharts, Val(num_groups));
@benchmark CovarianceRealism.update_individual_probs!($individual_probs, $pcj, $rcw, $, $Y)
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.876 μs (0.00% GC)
  median time:      3.893 μs (0.00% GC)
  mean time:        3.900 μs (0.00% GC)
  maximum time:     7.052 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8
In [16]:
length(individual_probs), size(individual_probs), 1024 * 6
Out[16]:
(6144, (1024, 6), 6144)
In [17]:
@benchmark CovarianceRealism.rand!($vector_rng, $groups, $individual_probs)
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     368.348 ns (0.00% GC)
  median time:      369.029 ns (0.00% GC)
  mean time:        369.774 ns (0.00% GC)
  maximum time:     450.546 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     207

Subjectively, those times look fast. So, lets run the entire sampler on a single core for the same number of iterations and warmup we ran JAGS, and time it.

In [18]:
@benchmark CovarianceRealism.singlechain_sample!($mcmcres, $Y, $rank1covs, $workingdata, $BPP32, $baseπ, $mcmc_iter, 4000)
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     79.896 ms (0.00% GC)
  median time:      81.036 ms (0.00% GC)
  mean time:        80.951 ms (0.00% GC)
  maximum time:     81.125 ms (0.00% GC)
  --------------
  samples:          62
  evals/sample:     1

That is 80 ms. JAGS took 138 seconds.

In [19]:
# seconds * ms/second / ms
138 * 1000 / 80
Out[19]:
1725.0

We are now roughly 1700 times faster than JAGS. Sweet!

In [20]:
# seconds * ms/second / ms
2.8 * 1000 / 80
Out[20]:
35.0

and about 35 times faster than our simple, unoptimized Julia code.

Additionally, I also wrote a C++ version of this Gibbs model to serve as a basis for comparison (the master branch requires avx512, while the `xsimd` branch only requires avx). I wanted to optimize it for the same platform to get a fair comparison, but I also did not want to spend the time to make it appropriately generic (able to run on most computers) like I did for the optimized Julia code.

The only C++ feature I used that doesn’t have an equivalent in C is templates. I used templates to make it easy to compile multiple separate versions of the sampler, for different desired numbers of mixture components. Although at the moment I only compiled it to use 6, to compare with the JAGS and Julia code.

I support both the Clang and GNU C++ compilers. Calling C from Julia is straighforward:

In [21]:
const CLANGBUILDDIR = "/home/chriselrod/Documents/progwork/Cxx/ClangCovarBuildDir";
const libgibbcovarrealismclang = joinpath(CLANGBUILDDIR,"libgibbcovarrealism.so");

function sample_6clang!(probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter)
    ccall((:sample_6groups, libgibbcovarrealismclang), Cvoid,
        (Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Int,Int,Int),
        probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter
    )
end
probs = mcmcres.Probs;
revcholwisharts = mcmcres.RevCholWisharts;
cholinvwisharts = mcmcres.CholInvWisharts;
base_p = Array(baseπ);

sample_6clang!(probs, revcholwisharts, cholinvwisharts, Y, BPP32, base_p, N, 4000, mcmc_iter)
In [22]:
@benchmark sample_6clang!($probs, $revcholwisharts, $cholinvwisharts, $Y, $BPP32, $base_p, $N, 4000, $mcmc_iter)
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     66.668 ms (0.00% GC)
  median time:      66.858 ms (0.00% GC)
  mean time:        66.865 ms (0.00% GC)
  maximum time:     66.972 ms (0.00% GC)
  --------------
  samples:          75
  evals/sample:     1
In [23]:
const GNUBUILDDIR = "/home/chriselrod/Documents/progwork/Cxx/gccCovarBuildDir"
const libgibbcovarrealismgnu = joinpath(GNUBUILDDIR,"libgibbcovarrealism.so");
function sample_6gnu!(probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter)
    ccall((:sample_6groups, libgibbcovarrealismgnu), Cvoid,
        (Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Ptr{Float32},Int,Int,Int),
        probs, revcholwisharts, cholinvwisharts, X, BPP, base_p, N, warmup, iter
    )
end

@benchmark sample_6gnu!($probs, $revcholwisharts, $cholinvwisharts, $Y, $BPP32, $base_p, $N, 4000, $mcmc_iter)
Out[23]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.709 ms (0.00% GC)
  median time:      76.728 ms (0.00% GC)
  mean time:        76.733 ms (0.00% GC)
  maximum time:     76.795 ms (0.00% GC)
  --------------
  samples:          66
  evals/sample:     1
In [24]:
138 * 1000 / 66.9
Out[24]:
2062.7802690582957
In [25]:
138 * 1000 / 76.7
Out[25]:
1799.2177314211212

Clang was around 2000 times faster (I realized later that the best run, which I put in the post with the JAGS run was actually 131 seconds, not 138)! That is the difference between a simulation running over night, versus taking two years!

Or the time it takes to take a sip of coffee, versus 3 hours.