I am demonstrating a few examples of MCMC models using a Julia MCMC library I have been working on.

This library, ProbabilityModels.jl, also depends on a host of unregistered libraries I’ve created. Being unregistered means that to add them, you’ll need the git url rather than just the package name (like `devtools::install_github`

versus `install.packages`

in `R`

). These libraries include:

- VectorizationBase.jl: a basic package providing some basic functions to help vectorize code. It also queries the computer’s CPU upon installation, to provide helpful information to other libraries so that code can be compiled for the particular host CPU (in particular, SIMD vector width.
- SIMDPirates.jl: provides many SIMD intrinsics, used for generating or explicitly writing low level code. Started as a fork of SIMD.jl, but I extended it to support a base Julia type. If I had done so by overloading base functions (I did not), this would have been “type piracy”. But it does provide a macro which does transform expressions to commit piracy, and because it started as a fork of another library, “pirate” was an appropriate name.
- SLEEFPirates.jl: provides SIMD versions of elemental functions, like
`exp`

and`log`

. Started as a fork of SLEEF.jl which is itself a port of the C library SLEEF, and it supports`SIMDPirates.jl`

, so calling it “pirate” again seemed appropriate. - VectorizedRNG.jl: provides vectorized random number generators, using permuted congrutional generators.
- LoopVectorization.jl: sometimes Julia/LLVM’s autovectorizer fails to vectorize “for loops”. This library provides a macro that assumes vectorization is legal (eg, arrays do not alias), and unrolls the for loops, inserting SIMD intrinsics as appropriate. Even when the auto-vectorizer works, loops unrolled by this library can be faster on some architectures (in particular, avx512) because
`LoopVectorization.jl`

catches the remainder of a loop that isn’t a multiple of SIMD vector width with masked vector instructions, instead of singleton iterations. - PaddedMatrices.jl: provides an array type with “padding”. This makes it much faster in some cases.
- ScatteredArrays.jl: provides arrays with a struct of arrays data layout, including a chunked array which is like an array of struct of arrays to improve cash locality. The chunk sizes match the SIMD vector width.
- StructuredMatrices.jl: provides array types with specific sparse structures. For example, triangular or autoregressive matrices.
- DistributionParameters.jl: provides a few parameter types for probability distributions, such as bounded and unbounded scalars, vectors, and matrices, as well as the Cholesky factor of an LKJ correlation matrix. I have not implemented much yet.
- ProbabilityDistributions.jl: provides a few probability distributions. A few methods for normal, gamma, and beta distributions, but I have not implemented much yet.
- ProbabilityModels.jl: provides a domain specific language for specifying statistical models, also providing the derivatives required by Hamiltonian Monte Carlo. This library needs a lot more work, and is still at the proof of concept stage.

To install them all, this should work:

```
using Pkg
pkgs = [
PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/SLEEFPirates.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/VectorizedRNG.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/LoopVectorization.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/PaddedMatrices.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/ScatteredArrays.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/StructuredMatrices.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/DistributionParameters.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/ProbabilityDistributions.jl", rev="master"),
PackageSpec(url="https://github.com/chriselrod/ProbabilityModels.jl", rev="master")
]
Pkg.add(pkgs)
Pkg.build("VectorizationBase")
```

You don’t need to get caught up in these. I bring them up here to say that this has been a long term project, that I’ve been steadily building up the tools I need. And that my predictions and performance goals are based on these libraries I am using, many of which are quite fast.

For example, comparing PaddedMatrices with StaticArrays.jl, one of the most popular Julia libraries for performance:

```
using StaticArrays, PaddedMatrices, BenchmarkTools
# Create two StaticArrays
SA1 = @SMatrix randn(7,10);
SA2 = @SMatrix randn(10, 9);
# benchmark matrix multiplication
@benchmark $SA1 * $SA2
```

```
# Create two PaddedMatrices
PA1 = @Constant randn(7,10);
PA2 = @Constant randn(10, 9);
# benchmark matrix multiplication
@benchmark $PA1 * $PA2
```

That is about 6 times faster!

Much of the difference is thanks to padding. For small-matrix multiplication to be vectorized, the lefthand matrix (SA1/PA1) needs to have a number of rows that’s a multiple of SIMD vector width. PaddedMatrices adds invisible padding to achieve this.

But that isn’t the only trick; it also does a better job ordering calculations to avoid “register spills”, so that it is faster for larger matrices, even when StaticArrays can vectorize them:

```
# Create two StaticArrays
SA3 = @SMatrix randn(16,42);
SA4 = @SMatrix randn(42,14);
# benchmark matrix multiplication
@benchmark $SA3 * $SA4
```

```
# Create two PaddedMatrices
PA3 = @Constant randn(16,42);
PA4 = @Constant randn(42,14);
# benchmark matrix multiplication
@benchmark $PA3 * $PA4
```

Generating random matrices is also faster, thanks to `VectorizedRNG.jl`

:

```
@benchmark @SMatrix randn(7,10)
```

```
@benchmark @Constant randn(7,10)
```

The point I am trying to make here is that it is not because my code is written in Julia that I expected it to be fast.

For one thing, it is easy to write slow code in most languages, and this is especially true for Julia compared to a language like `C++`

or `Fortran`

.

However, if you write similar code doing similar things in all three languages, performance will be similar. This is especially true when that `C++`

code is compiled with `Clang++`

(or Fortran with `Flang`

), because these compilers all use LLVM as the backend-optimizer. From that point of view, the actual language you’re using is largely just a different front-end to LLVM.

But those front ends change what LLVM is given to work with. Just like `C++`

templates can be used for writing fast generic code by compiling versions of algorithms or functions specialized for the specific characteristics of a problem, Julia’s advanced metaprogramming lets you do the same.

Without further delay, lets dive into a basic example to get started:

```
using ProbabilityModels, DistributionParameters, VectorizationBase, LoopVectorization
using LogDensityProblems, DynamicHMC, LinearAlgebra
@model LogisticRegressionModel begin
β₀ ~ Normal(μ₀, σ₀)
β₁ ~ Normal(μ₁, σ₁)
y ~ Bernoulli_logit(β₀ + X * β₁)
end
```

Here we define a logistic regression model. A brief rundown:

`@model`

indicates that you’re defining a model.

It requires two arguments:

- The name of the model. In this case,
`LogisticRegressionModel`

. - The model itself, in between a
`begin`

and`end`

block. It uses a familiar “distributed” syntax, and most unicode characters are legal.

With this, it infers the unknowns of the models (variables that don’t appear on the left hand side of an “=”).

To fit the model, lets generate some fake data:

```
N = 800; N_β = 4;
X = randn(N, N_β); # N x N_β matrix of random normal samples
β₁ = [-1.6, -1.75, -0.26, 0.65];
β₀ = -0.05
Xβ₁ = X * β₁;
p = rand(N); # N random uniform numbers between 0 and 1
y = @. p < 1 / (1 + exp( - Xβ₁ - β₀)); # inverse logit; generate random observations
sum(y) # how many ys are true?
```

Here, we have a sample size of 800, and with 4 covariates and an intercept.

Now, before fitting the model, we need to fill in all the unknowns, by either providing data (making them known), or explicitly declaring them as unknown:

```
ℓ_logistic = LogisticRegressionModel(
μ₀ = 0.0, σ₀ = 10.0, μ₁ = 0.0, σ₁ = 5.0,
β₀ = RealFloat, β₁ = RealVector{4},
y = y, X = X
);
```

So our only unknown parameters now are “β₀”, the intercept, and “β₁”, a vector of 4 coefficients. To draw 40000 samples:

```
@time mcmc_chain, tuned_sampler = NUTS_init_tune_mcmc_default(ℓ_logistic, 40000);
```

I chose such a large number for later comparison in runtime with Stan. I want to (a) avoid repeated Stan runs, and (b) amortize the overhead of calling Stan.

Note that the time here also included compiling the model. The model only compiles the first time it is run. Benchmarking repeated runs without the precompilation:

```
@benchmark NUTS_init_tune_mcmc_default($ℓ_logistic, 40000, report = DynamicHMC.ReportSilent())
```

One second isn’t a lot of time to wait, especially for this many iterations.

The (number% GC) refers to the percentage of the runtime taken by the garbage collector (GC).

Even when Julia’s garbage collector does nothing at all, it is very slow:

```
@benchmark GC.gc()
```

No memory was allocated between calls, meaning that roughly 70 milliseconds wasn’t spent freeing any memory — just checking if it was allowed to free anything.

`C++`

does not have a garbage collector, so this is a performance advantage inherent to `C++`

: by forcing you to manually manage memory, an otherwise equivalent `C++`

implementation would be $>4\%$ faster.

Of course, you can manually manage memory in Julia too. It’s just more difficult in Julia than it is in `C++`

.

Digression asside, let’s turn back to MCMC. We can get a summary of the behavior of the No U-Turn Sampler:

```
NUTS_statistics(mcmc_chain)
```

We had an acceptance rate of about $90\%$, a maximum tree depth of 7, with most iterations hitting a depth of 2 or 3.

Additionally, we can look at (and possibly reuse) the tuned sampler:

```
tuned_sampler
```

The maximum depth listed there is the maximum allowed, which defaults to 10 (as it does in Stan). We see that the stepsize was 0.271.

Means of the parameters (ordered “β₀”, and then “β₁” — you can tell based on the order they appear in “Unknowns: σ₁, β₀, y, β₁, X, μ₀, σ₀, μ₁.”):

```
using Statistics
sample_matrix = get_position_matrix(mcmc_chain)
mean(sample_matrix, dims = 1)
```

and the standard devation:

```
std(sample_matrix, dims = 1)
```

We can additionally look at the sample quantiles of $\beta_0$:

```
println("The true value of β₀: $β₀.")
println("Posterior quantiles:")
major_quantiles = [0.05 0.25 0.5 0.75 0.95]
quantile!(sample_matrix[:,1], major_quantiles)
```

These form a 90% interval, 50% interval, and the median.

and the quantiles for $\beta_1$:

```
for i ∈ 1:4
display("Major Quantiles for element #$i of β₁ with true value $(β₁[i]):")
display(vcat(major_quantiles, quantile!(sample_matrix[:,i+1], major_quantiles)))
end
```

Additionally, we can look at the autocorrelations:

```
using StatsBase
autocor(sample_matrix, 0:10)
```

A few of these first autocorrelation are negative, which is nice. Using these to estimate the effective sample sizes can thus result in effective sample sizes greater than the actual sample size:

```
40000 ./ (1 .+ 2 .* sum(autocor(sample_matrix, 1:100), dims = 1))
```

Alternatively, MCMCDiagnostics provides a convenience function using a different (and probably more correct) algorithm, although it always returns the minimum between the calculated effective sample size and actual number of samples:

```
using MCMCDiagnostics
[effective_sample_size(sample_matrix[:,i]) for i ∈ 1:size(sample_matrix,2)]'
```

`MCMCDiagnostics`

of course also provides the potential scale reduction factor (ie, the infamous $\hat{r}$). This requires multiple chains.

```
chains = [
get_position_matrix(
NUTS_init_tune_mcmc_default(
ℓ_logistic, 4000, report = DynamicHMC.ReportSilent()
)[1])
for i ∈ 1:10
]
for i ∈ 1:5
chainᵢ = [chain[:,i] for chain ∈ chains]
@show i, potential_scale_reduction(chainᵢ...)
end
```

Our $\hat{r}$ are all approximately 1.

Probably not a good idea to run 10 chains and use the results only to calculate $\hat{r}$, and then discard them.

Which is exactly what I did above.

Now, lets compare with Stan here. CmdStan 2.19 has been released, which added vectorized versions of the `bernoulli_logit`

distribution. It was discussed on Andrew Gelman’s blog back in 2017.

As of writing this, RStan is still on version 2.18, so I’ll using CmdStan.jl and try both versions:

```
using CmdStan
ProjDir = "/home/chriselrod/Documents/progwork/julia/StanProjDir"
bernoulli_logit = """
data {
int<lower=0> N;
int y[N];
matrix[N,4] X;
real mu0;
real mu1;
real sigma0;
real sigma1;
}
parameters {
real beta0;
vector[4] beta1;
}
model {
beta0 ~ normal(mu0, sigma0);
beta1 ~ normal(mu1, sigma1);
// y ~ bernoulli_logit_glm(X, beta0, beta1);
y ~ bernoulli_logit(beta0 + X * beta1);
}"""
bernoulli_logit_glm = """
data {
int<lower=0> N;
int y[N];
matrix[N,4] X;
real mu0;
real mu1;
real sigma0;
real sigma1;
}
parameters {
real beta0;
vector[4] beta1;
}
model {
beta0 ~ normal(mu0, sigma0);
beta1 ~ normal(mu1, sigma1);
y ~ bernoulli_logit_glm(X, beta0, beta1);
// y ~ bernoulli_logit(beta0 + X * beta1);
}"""
logistic_data_dict = Dict(
"N" => N, "y" => convert(Vector{Int32},y), "X" => X,
"mu0" => 0, "mu1" => 0, "sigma0" => 10, "sigma1" => 5
);
```

Now, let’s compile the models. I’m using only a single chain, because of models this fast the overhead of launching chains in parallel is high enough to make timing less accurate:

```
stanmodel_logistic = Stanmodel(
name = "logistic", Sample(num_samples=40000,num_warmup=900),
model = bernoulli_logit, nchains = 1
)
stanmodel_logistic_glm = Stanmodel(
name = "logistic_glm", Sample(num_samples=40000,num_warmup=900),
model = bernoulli_logit_glm, nchains = 1
);
```

```
@time rc, chns, cnames = stan(stanmodel_logistic, logistic_data_dict, ProjDir);
```

```
@time rc_glm, chns_glm, cnames_glm = stan(stanmodel_logistic_glm, logistic_data_dict, ProjDir);
```

These took about 9.82 and 7.36 seconds respectively, versus the roughly 1.1 seconds for the Julia version $-$ despite the fact that Stan also had a fast “vectorized” `bernoulli_logit_glm`

.

Interestingly, Stan used larger stepsizes (0.59 and 0.71 vs 0.242). Trees were about the same deep on average.