ProbabilityModelsExamples








I am demonstrating a few examples of MCMC models using a Julia MCMC library I have been working on.

This library, ProbabilityModels.jl, also depends on a host of unregistered libraries I’ve created. Being unregistered means that to add them, you’ll need the git url rather than just the package name (like devtools::install_github versus install.packages in R). These libraries include:

  • VectorizationBase.jl: a basic package providing some basic functions to help vectorize code. It also queries the computer’s CPU upon installation, to provide helpful information to other libraries so that code can be compiled for the particular host CPU (in particular, SIMD vector width.
  • SIMDPirates.jl: provides many SIMD intrinsics, used for generating or explicitly writing low level code. Started as a fork of SIMD.jl, but I extended it to support a base Julia type. If I had done so by overloading base functions (I did not), this would have been “type piracy”. But it does provide a macro which does transform expressions to commit piracy, and because it started as a fork of another library, “pirate” was an appropriate name.
  • SLEEFPirates.jl: provides SIMD versions of elemental functions, like exp and log. Started as a fork of SLEEF.jl which is itself a port of the C library SLEEF, and it supports SIMDPirates.jl, so calling it “pirate” again seemed appropriate.
  • VectorizedRNG.jl: provides vectorized random number generators, using permuted congrutional generators.
  • LoopVectorization.jl: sometimes Julia/LLVM’s autovectorizer fails to vectorize “for loops”. This library provides a macro that assumes vectorization is legal (eg, arrays do not alias), and unrolls the for loops, inserting SIMD intrinsics as appropriate. Even when the auto-vectorizer works, loops unrolled by this library can be faster on some architectures (in particular, avx512) because LoopVectorization.jl catches the remainder of a loop that isn’t a multiple of SIMD vector width with masked vector instructions, instead of singleton iterations.
  • PaddedMatrices.jl: provides an array type with “padding”. This makes it much faster in some cases.
  • ScatteredArrays.jl: provides arrays with a struct of arrays data layout, including a chunked array which is like an array of struct of arrays to improve cash locality. The chunk sizes match the SIMD vector width.
  • StructuredMatrices.jl: provides array types with specific sparse structures. For example, triangular or autoregressive matrices.
  • DistributionParameters.jl: provides a few parameter types for probability distributions, such as bounded and unbounded scalars, vectors, and matrices, as well as the Cholesky factor of an LKJ correlation matrix. I have not implemented much yet.
  • ProbabilityDistributions.jl: provides a few probability distributions. A few methods for normal, gamma, and beta distributions, but I have not implemented much yet.
  • ProbabilityModels.jl: provides a domain specific language for specifying statistical models, also providing the derivatives required by Hamiltonian Monte Carlo. This library needs a lot more work, and is still at the proof of concept stage.

To install them all, this should work:

using Pkg
pkgs = [
    PackageSpec(url="https://github.com/chriselrod/VectorizationBase.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/SIMDPirates.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/SLEEFPirates.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/VectorizedRNG.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/LoopVectorization.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/PaddedMatrices.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/ScatteredArrays.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/StructuredMatrices.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/DistributionParameters.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/ProbabilityDistributions.jl", rev="master"),
    PackageSpec(url="https://github.com/chriselrod/ProbabilityModels.jl", rev="master")
]
Pkg.add(pkgs)
Pkg.build("VectorizationBase")

You don’t need to get caught up in these. I bring them up here to say that this has been a long term project, that I’ve been steadily building up the tools I need. And that my predictions and performance goals are based on these libraries I am using, many of which are quite fast.

For example, comparing PaddedMatrices with StaticArrays.jl, one of the most popular Julia libraries for performance:

In [1]:
using StaticArrays, PaddedMatrices, BenchmarkTools

# Create two StaticArrays
SA1 = @SMatrix randn(7,10);
SA2 = @SMatrix randn(10, 9);
# benchmark matrix multiplication
@benchmark $SA1 * $SA2
Out[1]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     120.572 ns (0.00% GC)
  median time:      121.215 ns (0.00% GC)
  mean time:        124.964 ns (0.00% GC)
  maximum time:     280.343 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     883
In [2]:
# Create two PaddedMatrices
PA1 = @Constant randn(7,10);
PA2 = @Constant randn(10, 9);
# benchmark matrix multiplication
@benchmark $PA1 * $PA2
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.616 ns (0.00% GC)
  median time:      21.771 ns (0.00% GC)
  mean time:        21.851 ns (0.00% GC)
  maximum time:     48.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

That is about 6 times faster!

Much of the difference is thanks to padding. For small-matrix multiplication to be vectorized, the lefthand matrix (SA1/PA1) needs to have a number of rows that’s a multiple of SIMD vector width. PaddedMatrices adds invisible padding to achieve this.

But that isn’t the only trick; it also does a better job ordering calculations to avoid “register spills”, so that it is faster for larger matrices, even when StaticArrays can vectorize them:

In [3]:
# Create two StaticArrays
SA3 = @SMatrix randn(16,42);
SA4 = @SMatrix randn(42,14);
# benchmark matrix multiplication
@benchmark $SA3 * $SA4
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     291.620 ns (0.00% GC)
  median time:      291.788 ns (0.00% GC)
  mean time:        292.323 ns (0.00% GC)
  maximum time:     464.985 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     274
In [4]:
# Create two PaddedMatrices
PA3 = @Constant randn(16,42);
PA4 = @Constant randn(42,14);
# benchmark matrix multiplication
@benchmark $PA3 * $PA4
Out[4]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     147.671 ns (0.00% GC)
  median time:      160.368 ns (0.00% GC)
  mean time:        158.171 ns (0.00% GC)
  maximum time:     361.973 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     788

Generating random matrices is also faster, thanks to VectorizedRNG.jl:

In [5]:
@benchmark @SMatrix randn(7,10)
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     288.108 ns (0.00% GC)
  median time:      296.077 ns (0.00% GC)
  mean time:        298.506 ns (0.00% GC)
  maximum time:     832.298 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     305
In [6]:
@benchmark @Constant randn(7,10)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     158.137 ns (0.00% GC)
  median time:      158.560 ns (0.00% GC)
  mean time:        159.144 ns (0.00% GC)
  maximum time:     351.770 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     788

The point I am trying to make here is that it is not because my code is written in Julia that I expected it to be fast.
For one thing, it is easy to write slow code in most languages, and this is especially true for Julia compared to a language like C++ or Fortran.

However, if you write similar code doing similar things in all three languages, performance will be similar. This is especially true when that C++ code is compiled with Clang++ (or Fortran with Flang), because these compilers all use LLVM as the backend-optimizer. From that point of view, the actual language you’re using is largely just a different front-end to LLVM.

But those front ends change what LLVM is given to work with. Just like C++ templates can be used for writing fast generic code by compiling versions of algorithms or functions specialized for the specific characteristics of a problem, Julia’s advanced metaprogramming lets you do the same.

Without further delay, lets dive into a basic example to get started:

In [7]:
using ProbabilityModels, DistributionParameters, VectorizationBase, LoopVectorization
using LogDensityProblems, DynamicHMC, LinearAlgebra

@model LogisticRegressionModel begin
    β₀ ~ Normal(μ₀, σ₀)
    β₁ ~ Normal(μ₁, σ₁)
    y ~ Bernoulli_logit(β₀ + X * β₁)
end
    Defined model: LogisticRegressionModel.
    Unknowns: σ₁, β₀, y, β₁, X, μ₀, σ₀, μ₁.

Here we define a logistic regression model. A brief rundown:

@model indicates that you’re defining a model.

It requires two arguments:

  1. The name of the model. In this case, LogisticRegressionModel.
  2. The model itself, in between a begin and end block. It uses a familiar “distributed” syntax, and most unicode characters are legal.

With this, it infers the unknowns of the models (variables that don’t appear on the left hand side of an “=”).

To fit the model, lets generate some fake data:

In [8]:
N = 800; N_β = 4;
X = randn(N, N_β); # N x N_β matrix of random normal samples
β₁ = [-1.6, -1.75, -0.26, 0.65];
β₀ = -0.05
Xβ₁ = X * β₁;
p = rand(N); # N random uniform numbers between 0 and 1
y = @. p < 1 / (1 + exp( - Xβ₁ - β₀)); # inverse logit; generate random observations
sum(y) # how many ys are true?
Out[8]:
385

Here, we have a sample size of 800, and with 4 covariates and an intercept.

Now, before fitting the model, we need to fill in all the unknowns, by either providing data (making them known), or explicitly declaring them as unknown:

In [9]:
ℓ_logistic = LogisticRegressionModel(
    μ₀ = 0.0, σ₀ = 10.0, μ₁ = 0.0, σ₁ = 5.0,
    β₀ = RealFloat, β₁ = RealVector{4},
    y = y, X = X
);

So our only unknown parameters now are “β₀”, the intercept, and “β₁”, a vector of 4 coefficients. To draw 40000 samples:

In [10]:
@time mcmc_chain, tuned_sampler = NUTS_init_tune_mcmc_default(ℓ_logistic, 40000);
MCMC, adapting ϵ (75 steps)
3.2e-5 s/step ...done
MCMC, adapting ϵ (25 steps)
3.1e-5 s/step ...done
MCMC, adapting ϵ (50 steps)
0.00027 s/step ...done
MCMC, adapting ϵ (100 steps)
3.2e-5 s/step ...done
MCMC, adapting ϵ (200 steps)
5.3e-5 s/step ...done
MCMC, adapting ϵ (400 steps)
2.5e-5 s/step ...done
MCMC, adapting ϵ (50 steps)
3.9e-5 s/step ...done
MCMC (40000 steps)
2.5e-5 s/step ...done
  9.308906 seconds (24.65 M allocations: 1.978 GiB, 4.39% gc time)

I chose such a large number for later comparison in runtime with Stan. I want to (a) avoid repeated Stan runs, and (b) amortize the overhead of calling Stan.

Note that the time here also included compiling the model. The model only compiles the first time it is run. Benchmarking repeated runs without the precompilation:

In [11]:
@benchmark NUTS_init_tune_mcmc_default($ℓ_logistic, 40000, report = DynamicHMC.ReportSilent())
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  990.90 MiB
  allocs estimate:  4944641
  --------------
  minimum time:     965.463 ms (3.72% GC)
  median time:      1.124 s (3.79% GC)
  mean time:        1.096 s (4.98% GC)
  maximum time:     1.208 s (3.88% GC)
  --------------
  samples:          5
  evals/sample:     1

One second isn’t a lot of time to wait, especially for this many iterations.
The (number% GC) refers to the percentage of the runtime taken by the garbage collector (GC).

Even when Julia’s garbage collector does nothing at all, it is very slow:

In [12]:
@benchmark GC.gc()
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     64.589 ms (100.00% GC)
  median time:      66.410 ms (100.00% GC)
  mean time:        66.002 ms (100.00% GC)
  maximum time:     67.016 ms (100.00% GC)
  --------------
  samples:          76
  evals/sample:     1

No memory was allocated between calls, meaning that roughly 70 milliseconds wasn’t spent freeing any memory — just checking if it was allowed to free anything.

C++ does not have a garbage collector, so this is a performance advantage inherent to C++: by forcing you to manually manage memory, an otherwise equivalent C++ implementation would be $>4\%$ faster.

Of course, you can manually manage memory in Julia too. It’s just more difficult in Julia than it is in C++.

Digression asside, let’s turn back to MCMC. We can get a summary of the behavior of the No U-Turn Sampler:

In [13]:
NUTS_statistics(mcmc_chain)
Out[13]:
Hamiltonian Monte Carlo sample of length 40000
  acceptance rate mean: 0.93, min/25%/median/75%/max: 0.31 0.89 0.96 0.99 1.0
  termination: AdjacentTurn => 13% DoubledTurn => 87%
  depth: 1 => 0% 2 => 31% 3 => 69%

We had an acceptance rate of about $90\%$, a maximum tree depth of 7, with most iterations hitting a depth of 2 or 3.
Additionally, we can look at (and possibly reuse) the tuned sampler:

In [14]:
tuned_sampler
Out[14]:
NUTS sampler in 5 dimensions
  stepsize (ϵ) ≈ 0.242
  maximum depth = 10
  Gaussian kinetic energy, √diag(M⁻¹): [0.26132952478019855, 0.2739029924878523, 0.27318106557562166, 0.25978575168086654, 0.2560530515594817]

The maximum depth listed there is the maximum allowed, which defaults to 10 (as it does in Stan). We see that the stepsize was 0.271.

Means of the parameters (ordered “β₀”, and then “β₁” — you can tell based on the order they appear in “Unknowns: σ₁, β₀, y, β₁, X, μ₀, σ₀, μ₁.”):

In [15]:
using Statistics
sample_matrix = get_position_matrix(mcmc_chain)
mean(sample_matrix, dims = 1)
Out[15]:
1×5 Array{Float64,2}:
 0.0304069  -1.56048  -1.61189  -0.341209  0.607786

and the standard devation:

In [16]:
std(sample_matrix, dims = 1)
Out[16]:
1×5 Array{Float64,2}:
 0.0942155  0.128844  0.131392  0.100252  0.0934212

We can additionally look at the sample quantiles of $\beta_0$:

In [17]:
println("The true value of β₀: $β₀.")
println("Posterior quantiles:")
major_quantiles = [0.05 0.25 0.5 0.75 0.95]
quantile!(sample_matrix[:,1], major_quantiles)
The true value of β₀: -0.05.
Posterior quantiles:
Out[17]:
1×5 Array{Float64,2}:
 -0.123821  -0.0332191  0.029696  0.0940319  0.186714

These form a 90% interval, 50% interval, and the median.

and the quantiles for $\beta_1$:

In [18]:
for i  1:4
    display("Major Quantiles for element #$i of β₁ with true value $(β₁[i]):")
    display(vcat(major_quantiles, quantile!(sample_matrix[:,i+1], major_quantiles)))
end
"Major Quantiles for element #1 of β₁ with true value -1.6:"
2×5 Array{Float64,2}:
  0.05    0.25      0.5       0.75      0.95  
 -1.776  -1.64686  -1.55847  -1.47255  -1.3515
"Major Quantiles for element #2 of β₁ with true value -1.75:"
2×5 Array{Float64,2}:
  0.05      0.25      0.5       0.75      0.95   
 -1.83297  -1.69838  -1.60957  -1.52234  -1.40073
"Major Quantiles for element #3 of β₁ with true value -0.26:"
2×5 Array{Float64,2}:
  0.05     0.25       0.5        0.75       0.95    
 -0.5065  -0.408963  -0.340627  -0.272998  -0.177104
"Major Quantiles for element #4 of β₁ with true value 0.65:"
2×5 Array{Float64,2}:
 0.05      0.25      0.5       0.75     0.95    
 0.455393  0.545229  0.606642  0.66969  0.762706

Additionally, we can look at the autocorrelations:

In [19]:
using StatsBase
autocor(sample_matrix, 0:10)
Out[19]:
11×5 Array{Float64,2}:
  1.0           1.0          1.0           1.0           1.0        
 -0.163296      0.0292223    0.0408467    -0.123452     -0.125508   
  0.0303602     0.0227095    0.0317597     0.0128804     0.0355236  
 -0.00746949    0.00168336   0.00713094   -0.0017543    -0.0025821  
  0.00513915    0.00382935  -0.00491197   -0.00456616   -0.00259444 
 -0.000349513   0.00313788   0.00135416    0.000744715   0.00111665 
  0.000828628   0.00394872   0.00668702   -0.0019146    -0.00162214 
  0.0014975    -0.00318541   2.79615e-5    0.0017537     0.000604525
  0.000292835  -0.00368561  -0.00429031    0.00270876    0.00193301 
  0.00768295   -0.00352947  -0.000313038  -0.00269641    0.00275268 
  0.00122586    0.00638672   0.00616584   -0.00855436   -0.00744474 

A few of these first autocorrelation are negative, which is nice. Using these to estimate the effective sample sizes can thus result in effective sample sizes greater than the actual sample size:

In [20]:
40000 ./ (1 .+ 2 .* sum(autocor(sample_matrix, 1:100), dims = 1))
Out[20]:
1×5 Array{Float64,2}:
 53037.0  36118.6  33355.6  52029.1  47416.2

Alternatively, MCMCDiagnostics provides a convenience function using a different (and probably more correct) algorithm, although it always returns the minimum between the calculated effective sample size and actual number of samples:

In [21]:
using MCMCDiagnostics
[effective_sample_size(sample_matrix[:,i]) for i  1:size(sample_matrix,2)]'
Out[21]:
1×5 Adjoint{Float64,Array{Float64,1}}:
 40000.0  35645.1  34501.6  40000.0  40000.0

MCMCDiagnostics of course also provides the potential scale reduction factor (ie, the infamous $\hat{r}$). This requires multiple chains.

In [22]:
chains = [
    get_position_matrix(
        NUTS_init_tune_mcmc_default(
            ℓ_logistic, 4000, report = DynamicHMC.ReportSilent()
        )[1])
    for i  1:10
]
for i  1:5
    chainᵢ = [chain[:,i] for chain  chains]
    @show i, potential_scale_reduction(chainᵢ...)
end
(i, potential_scale_reduction(chainᵢ...)) = (1, 1.000046067877965)
(i, potential_scale_reduction(chainᵢ...)) = (2, 1.0001716334189055)
(i, potential_scale_reduction(chainᵢ...)) = (3, 1.0001287737100368)
(i, potential_scale_reduction(chainᵢ...)) = (4, 1.0000329709306899)
(i, potential_scale_reduction(chainᵢ...)) = (5, 1.00034120920148)

Our $\hat{r}$ are all approximately 1.

Probably not a good idea to run 10 chains and use the results only to calculate $\hat{r}$, and then discard them.

Which is exactly what I did above.

Now, lets compare with Stan here. CmdStan 2.19 has been released, which added vectorized versions of the bernoulli_logit distribution. It was discussed on Andrew Gelman’s blog back in 2017.
As of writing this, RStan is still on version 2.18, so I’ll using CmdStan.jl and try both versions:

In [23]:
using CmdStan

ProjDir = "/home/chriselrod/Documents/progwork/julia/StanProjDir"

bernoulli_logit = """
data {
  int<lower=0> N;
  int y[N];
  matrix[N,4] X;
  real mu0;
  real mu1;
  real sigma0;
  real sigma1;
}
parameters {
  real beta0;
  vector[4] beta1;
}
model {
  beta0 ~ normal(mu0, sigma0);
  beta1 ~ normal(mu1, sigma1);
  // y ~ bernoulli_logit_glm(X, beta0, beta1);
  y ~ bernoulli_logit(beta0 + X * beta1);
}"""

bernoulli_logit_glm = """
data {
  int<lower=0> N;
  int y[N];
  matrix[N,4] X;
  real mu0;
  real mu1;
  real sigma0;
  real sigma1;
}
parameters {
  real beta0;
  vector[4] beta1;
}
model {
  beta0 ~ normal(mu0, sigma0);
  beta1 ~ normal(mu1, sigma1);
  y ~ bernoulli_logit_glm(X, beta0, beta1);
  // y ~ bernoulli_logit(beta0 + X * beta1);
}"""


logistic_data_dict = Dict(
    "N" => N, "y" => convert(Vector{Int32},y), "X" => X,
    "mu0" => 0, "mu1" => 0, "sigma0" => 10, "sigma1" => 5
);

Now, let’s compile the models. I’m using only a single chain, because of models this fast the overhead of launching chains in parallel is high enough to make timing less accurate:

In [24]:
stanmodel_logistic = Stanmodel(
    name = "logistic", Sample(num_samples=40000,num_warmup=900),
    model = bernoulli_logit, nchains = 1
)
stanmodel_logistic_glm = Stanmodel(
    name = "logistic_glm", Sample(num_samples=40000,num_warmup=900),
    model = bernoulli_logit_glm, nchains = 1
);
In [25]:
@time rc, chns, cnames = stan(stanmodel_logistic, logistic_data_dict, ProjDir);
Inference for Stan model: logistic_model
1 chains: each with iter=(40000); warmup=(0); thin=(1); 40000 iterations saved.

Warmup took (0.22) seconds, 0.22 seconds total
Sampling took (9.6) seconds, 9.6 seconds total

                 Mean     MCSE   StdDev     5%    50%    95%  N_Eff  N_Eff/s  R_hat
lp__             -354  1.2e-02  1.6e+00   -357   -353   -352  19398     2016    1.0
accept_stat__    0.92  4.0e-04  9.1e-02   0.73   0.95    1.0  51968     5400   1.00
stepsize__       0.59  4.0e-14  2.9e-14   0.59   0.59   0.59   0.50    0.052   1.00
treedepth__       2.6  2.5e-03  4.9e-01    2.0    3.0    3.0  38384     3989   1.00
n_leapfrog__      6.4  1.1e-02  2.1e+00    3.0    7.0    7.0  39020     4055   1.00
divergent__      0.00     -nan  0.0e+00   0.00   0.00   0.00   -nan     -nan   -nan
energy__          356  1.8e-02  2.3e+00    353    356    360  16347     1699    1.0
beta0           0.030  4.3e-04  9.4e-02  -0.12  0.030   0.19  47124     4897   1.00
beta1[1]         -1.6  6.8e-04  1.3e-01   -1.8   -1.6   -1.3  36572     3800   1.00
beta1[2]         -1.6  7.0e-04  1.3e-01   -1.8   -1.6   -1.4  36548     3798   1.00
beta1[3]        -0.34  4.7e-04  1.0e-01  -0.51  -0.34  -0.18  45501     4728    1.0
beta1[4]         0.61  4.5e-04  9.3e-02   0.46   0.61   0.76  42382     4404   1.00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

 15.250890 seconds (18.96 M allocations: 952.088 MiB, 2.26% gc time)
In [26]:
@time rc_glm, chns_glm, cnames_glm = stan(stanmodel_logistic_glm, logistic_data_dict, ProjDir);
Inference for Stan model: logistic_glm_model
1 chains: each with iter=(40000); warmup=(0); thin=(1); 40000 iterations saved.

Warmup took (0.16) seconds, 0.16 seconds total
Sampling took (7.2) seconds, 7.2 seconds total

                 Mean     MCSE   StdDev     5%    50%    95%  N_Eff  N_Eff/s  R_hat
lp__             -354  1.1e-02  1.6e+00   -357   -353   -352  20731     2895   1.00
accept_stat__    0.89  5.2e-04  1.2e-01   0.65   0.93    1.0  52813     7376   1.00
stepsize__       0.71  8.8e-14  6.2e-14   0.71   0.71   0.71   0.50    0.070   1.00
treedepth__       2.5  2.7e-03  5.3e-01    2.0    3.0    3.0  39000     5447    1.0
n_leapfrog__      6.3  1.5e-02  3.0e+00    3.0    7.0     15  38576     5387    1.0
divergent__      0.00     -nan  0.0e+00   0.00   0.00   0.00   -nan     -nan   -nan
energy__          356  1.7e-02  2.2e+00    353    356    360  17605     2459   1.00
beta0           0.030  4.6e-04  9.5e-02  -0.13  0.030   0.19  42192     5892   1.00
beta1[1]         -1.6  7.1e-04  1.3e-01   -1.8   -1.6   -1.4  33924     4738   1.00
beta1[2]         -1.6  7.1e-04  1.3e-01   -1.8   -1.6   -1.4  35587     4970   1.00
beta1[3]        -0.34  5.1e-04  1.0e-01  -0.51  -0.34  -0.17  38730     5409   1.00
beta1[4]         0.61  4.7e-04  9.3e-02   0.46   0.61   0.76  39628     5534   1.00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

  8.188168 seconds (4.01 M allocations: 168.608 MiB, 0.46% gc time)

These took about 9.82 and 7.36 seconds respectively, versus the roughly 1.1 seconds for the Julia version $-$ despite the fact that Stan also had a fast “vectorized” bernoulli_logit_glm.

Interestingly, Stan used larger stepsizes (0.59 and 0.71 vs 0.242). Trees were about the same deep on average.