Now, we look at an ITP model. This model takes longer to run, so we’ll want to fit multiple chains in parallel.
The easiest way to do that at the moment is with the standard library Distributed
.
This library is for multiprocessing. We add child processes, and can then have them run code.
Unfortunately, this requires prefixing some code with @everywhere
, so that the children are aware of the functions and data types we’re using.
Julia also supports multithreading, but these are still experimental. In particular, IO from base Julia is not thread safe, so something as simple as printing from two threads at the same time is likely to freeze or crash Julia. I didn’t actually print in my example here, so I will likely try threading in a future version of this document.
using Distributed
addprocs(14); # Create 14 children
@everywhere begin # Load the following on master and all children
using DistributionParameters, ProbabilityDistributions
using LoopVectorization, DynamicHMC, LogDensityProblems, SLEEFPirates, SIMDPirates
using LinearAlgebra, StructuredMatrices, ScatteredArrays, PaddedMatrices
using ProbabilityModels
using DistributionParameters: LKJ_Correlation_Cholesky
using ProbabilityModels: Domains, HierarchicalCentering, ∂HierarchicalCentering, ITPExpectedValue, ∂ITPExpectedValue
BLAS.set_num_threads(1)
end # everywhere
# Define the model on master and all children.
@everywhere @model ITPModel begin
# Non-hierarchical Priors
(0.5ρ + 0.5) ~ Beta(2, 2)
κ ~ Gamma(0.1, 0.1) # μ = 1, σ² = 10
σ ~ Gamma(1.5, 0.25) # μ = 6, σ² = 2.4
θ ~ Normal(10)
L ~ LKJ(2.0)
# Hierarchical Priors.
# h subscript, for highest in the hierarhcy.
μₕ₁ ~ Normal(10) # μ = 0
μₕ₂ ~ Normal(10) # μ = 0
σₕ ~ Normal(10) # μ = 0
# Raw μs; non-cenetered parameterization
μᵣ₁ ~ Normal() # μ = 0, σ = 1
μᵣ₂ ~ Normal() # μ = 0, σ = 1
# Center the μs
μᵦ₁ = HierarchicalCentering(μᵣ₁, μₕ₁, σₕ)
μᵦ₂ = HierarchicalCentering(μᵣ₂, μₕ₂, σₕ)
σᵦ ~ Normal(10) # μ = 0
# Raw βs; non-cenetered parameterization
βᵣ₁ ~ Normal()
βᵣ₂ ~ Normal()
# Center the βs.
β₁ = HierarchicalCentering(βᵣ₁, μᵦ₁, σᵦ, domains)
β₂ = HierarchicalCentering(βᵣ₂, μᵦ₂, σᵦ, domains)
U = inv′(Diagonal(σ) * L)
# Likelihood
μ₁ = ITPExpectedValue(t, β₁, κ, θ)
μ₂ = ITPExpectedValue(t, β₂, κ, θ)
AR = AutoregressiveMatrix(ρ, δₜ)
Y₁ ~ Normal(μ₁, AR, U)
Y₂ ~ Normal(μ₂, AR, U)
end
This model shows off the overloading of distributions. In particular, we have the Normal
with 0, 1, and 3 parameters, and HierarchicalCentering
with 3 and 4 paameters.
For example, when given only one argument, the Normal distribution has mean 0.
When given 3 arguments, the Normal distribution distribution is a matrix normal. Additionally, many of these calls are vectorized, depending on whether or not the arguments are vectors.
Also neat is that we can place expressions on the left hand side of sampling statements. The use case here is that $\rho\in(-1,1)$, so we can scale it to be in $(-1,1)$ and assign a Beta$(2,2)$ prior.
HierarchicalCentering
transforms from a non-centered to a centered parameterization. If given a fourth argument — domains — it will break up the centering along domains. That is, we can define the domains as:
domains = Domains(2,2,2,3)
And now we have
K, D = sum(domains), length(domains)
nine total endpoints dsitributed across 4 domains of sizes 2, 2, 2, and 3. HierarchicalCentering
will divide the 4 values of means and standard deviations among the non-centered parameters accordingly: the first 2 will have the first mean and standard deviation, the next 2 the second…
ITPExpectedValue
calculates a $T \times K$ matrix of unscaled ITP values, where $T$ is the number of times, and $K$ is the number of endpoints and lengths of $\beta, \kappa,$ and $\theta$. Then element of the matrix equal:
$f(t, \beta, \kappa, \theta) = \beta\left(1+e^{-\kappa t}\right) + \theta$.
Because these are all Julia functions, we are free to try them out interactively, or use them to help generate sample data with which to validate the model. Let’s do that:
κ = (1/32) * reduce(+, (@Constant randexp(K)) for i ∈ 1:8) # κ ~ Gamma(8, 32)
σd = sum(@Constant randexp(4)) / 16 # σd ~ Gamma(4,16)
θ = 2.0 * (@Constant randn(K)) # θ ~ Normal(0,2)
S = (@Constant randn(K,4K)) |> x -> x * x'
S *= (1/16)
pS = StructuredMatrices.SymmetricMatrixL(S)
L = PaddedMatrices.chol(S); U = PaddedMatrices.invchol(pS)
μₕ₁, μₕ₂ = -3.0, 9.0
μᵦ₁ = μₕ₁ + @Constant randn(D); # placebo
μᵦ₂ = μₕ₂ + @Constant randn(D); #treatment
β₁ = HierarchicalCentering((@Constant randn(K)), μᵦ₁, σd, domains); # placebo
β₂ = HierarchicalCentering((@Constant randn(K)), μᵦ₂, σd, domains); # treatment
# rand generates uniform(0,1); we take the cumulate sum for the times.
T = 24; δₜ = (1/16) * reduce(+, (@Constant randexp(T-1)) for i ∈ 1:8)
times = vcat(zero(ConstantFixedSizePaddedVector{1,Float64}), cumsum(δₜ));
μ₁ = ITPExpectedValue(times, β₁, κ, θ)
μ₂ = ITPExpectedValue(times, β₂, κ, θ)
You can see here how the mean values increase over time (as we move down the rows), and that baselines differ across columns. $\mu_1$ tends to decrease over time, while $\mu_2$ increases.
Now, let’s generate the rest of the data, and specify all our unknowns.
ρ = 0.7
ARcorrelation = StructuredMatrices.AutoregressiveMatrix(ρ, δₜ)
ARcholesky = PaddedMatrices.chol(ConstantFixedSizePaddedMatrix(ARcorrelation))
# Create an Array of matrix-normal entries.
Y₁a = [ARcholesky * (@Constant randn(T, K)) * L' + μ₁ for n in 1:120]
Y₂a = [ARcholesky * (@Constant randn(T, K)) * L' + μ₂ for n in 1:120]
Y₁ = ChunkedArray(Y₁a) # Rearranges how the data is stored under the hood.
Y₂ = ChunkedArray(Y₂a) # This often allows for better vectorization.
ℓ_itp = ITPModel(
domains = domains, Y₁ = Y₁, Y₂ = Y₂, t = times, δₜ = δₜ,
L = LKJ_Correlation_Cholesky{K}, ρ = BoundedFloat{-1,1},
κ = PositiveVector{K}, θ = RealVector{K},
μₕ₁ = RealFloat, μₕ₂ = RealFloat,
μᵣ₁ = RealVector{D}, μᵣ₂ = RealVector{D},
βᵣ₁ = RealVector{K}, βᵣ₂ = RealVector{K},
σᵦ = PositiveFloat, σₕ = PositiveFloat,
σ = PositiveVector{K}
);
But before fitting this model, we’ll allow Stan to set the baseline. Our equivalent Stan model:
using CmdStan
ProjDir = "/home/chriselrod/Documents/progwork/julia/StanProjDir"
function itp_stan_model(; T = "T", K = "K", D = "D")
if T isa Number
T_def = ""
Tm1 = T - 1
Tm1_def = ""
else
T_def = "int T; // Number of times"
Tm1 = "Tm1"
Tm1_def = "int Tm1 = T-1;"
end
K_def = K isa Number ? "" : "int K; // Number end points"
D_def = D isa Number ? "" : "int D; // Number domains"
"""
data {
int N1; // Sample size for both categories
int N2; // Sample size for both categories
$T_def
$K_def
$D_def
int domains[$D];
// Dose 1
matrix[$K,$T] Y1[N1];
// Dose 2
matrix[$K,$T] Y2[N2];
row_vector[$T] times;
}
transformed data{
int N = N1 + N2;
$Tm1_def
int ind = 0;
row_vector[$Tm1] delta_times = times[2:$T] - times[1:$Tm1];
matrix[$K,$D] domain_map = rep_matrix(0, $K, $D);
for (d in 1:$D){
for (k in 1:domains[d]){
ind += 1;
domain_map[ind,d] = 1;
}
}
}
parameters {
real muh[2];
real<lower=-1,upper=1> rho;
vector<lower=0>[$K] kappa;
vector[$K] theta;
cholesky_factor_corr[$K] L;
vector[$D] muraw[2];
vector[$K] betaraw[2];
real<lower=0> sigma_beta;
real<lower=0> sigma_h;
vector<lower=0>[$K] sigma;
}
model {
real scaledrho = 0.5*rho + 0.5;
vector[$K] mu_beta[2];
vector[$K] beta[2];
row_vector[$Tm1] temp;
vector[$T] temp2;
row_vector[$Tm1] AR_diag;
row_vector[$Tm1] nAR_offdiag;
matrix[$K,$T] kappa_time;
matrix[$K,$T] delta;
matrix[$K,$T] SLdelta;
matrix[$K,$Tm1] SLdeltaAR;
matrix[$K,$T] itp_expected_value[2];
matrix[$K,$K] SL = diag_pre_multiply(sigma, L);
scaledrho ~ beta(2,2);
kappa ~ gamma(0.1,0.1);
sigma ~ gamma(1.5, 0.25);
theta ~ normal(0, 10);
L ~ lkj_corr_cholesky(2);
for (i in 1:2){
muh[i] ~ normal(0,10);
muraw[i] ~ normal(0,1);
betaraw[i] ~ normal(0,1);
mu_beta[i] = domain_map * (muh[i] + muraw[i] * sigma_h);
beta[i] = mu_beta[i] + betaraw[i] * sigma_beta;
}
sigma_h ~ normal(0, 10);
sigma_beta ~ normal(0, 10);
if (rho > 0){ // temp = - sign(rho) * abs(rho)^delta_times
temp = exp(delta_times * log( rho));
} else if (rho == 0) {
temp = rep_row_vector(0, $Tm1);
} else { // rho < 0
temp = - exp(delta_times * log(-rho));
}
AR_diag = inv_sqrt(1 - temp .* temp);
nAR_offdiag = AR_diag .* temp;
kappa_time = expm1(-1 * kappa * times);
for (i in 1:2){
itp_expected_value[i] = rep_matrix(theta, $T) - rep_matrix(beta[i], $T) .* kappa_time;
}
// covariance logdeterminant
// AR is the inverse of the cholesky factor of a correlation matrix => +
// SL is the cholesky factor of a covariance matrix => -
target += N*($K*sum(log(AR_diag)) - $T*log_determinant(SL));
for (n in 1:N1){
delta = Y1[n] - itp_expected_value[1];
SLdelta = mdivide_left_tri_low(SL, delta);
target += -0.5 * dot_self(SLdelta[:,1]);
SLdeltaAR = SLdelta[:,2:$T] .* rep_matrix(AR_diag, $K) - SLdelta[:,1:$Tm1] .* rep_matrix(nAR_offdiag, $K);
target += -0.5 * sum(columns_dot_self(SLdeltaAR));
}
for (n in 1:N2){
delta = Y2[n] - itp_expected_value[2];
SLdelta = mdivide_left_tri_low(SL, delta);
target += -0.5 * dot_self(SLdelta[:,1]);
SLdeltaAR = SLdelta[:,2:$T] .* rep_matrix(AR_diag, $K) - SLdelta[:,1:$Tm1] .* rep_matrix(nAR_offdiag, $K);
target += -0.5 * sum(columns_dot_self(SLdeltaAR));
}
}
"""
end
stan_itp_data_dict = Dict(
"N1" => length(Y₁),
"N2" => length(Y₂),
"T" => T,
"K" => K,
"D" => D,
"domains" => Array(domains),
"Y1" => permutedims(reshape(reinterpret(Float64, Y₁a),T,K,length(Y₁)),(3,2,1)),
"Y2" => permutedims(reshape(reinterpret(Float64, Y₂a),T,K,length(Y₂)),(3,2,1)),
"times" => times
)
stanmodel_itp = Stanmodel(
name = "ITP",
Sample(
num_samples=2000,num_warmup=900,
adapt=CmdStan.Adapt(delta=0.99)
), model = itp_stan_model(), nchains = 14
);
Running it:
@time rc_itp, chns_itp, cnames_itp = stan(stanmodel_itp, stan_itp_data_dict, ProjDir);
We see that it failed to converge. Chains 3, 10, 12, and 13 ran much faster than the other 10, which means that these two sets adapted differently. Naturally, we suspect that one of these adaptations was “correct”, causing convergence, and the other “incorrect”, resulting in convergence failure.
Before focusing on analysis, lets perform a second run to see if we can speed things up by making many of these arrays statically sized.
“Statically sized” here means, making Stan (and C++
) already aware of the sizes of the arrays while compiling, instead of only while running. This is one of the “tricks” I often use in my Julia code to get massive speed ups. It is also a very natural trick to exploit in Julia, because you’ll never personally have to recompile anything: it will just happen automatically whenever you use arrays of diferent sizes. Multiple dispatch also means we will not have to define any separate functions or models to make it work. It’s all automatic.
We have to do things a little more manually in Stan. This is why I wrote the Stan model above as a function that returns a string, to let me define array sizes as either variables, or as inserted numbers.
This also makes it clear what I mean by making this seemless in Julia: our Julia model definition looks a lot more like the first, with variable sized arrays. But it compiles as the second, with awareness of the array sizes. As we will see from the difference in runtime here, while Julia also benefits from this, Stan apparently does not.
To provide a concrete example of why knowing array bounds may help:
Without knowing the bounds of an array, the clang++
compiler will often vectorize simple “for loop”s of length $N$ by splitting it into two for loops. The first of these loops is “vectorized”, calculating 32 iterations at a time in parallel (while still only using a single core). The second of these loops does a single iteration, and catches the remainder of $N\div32$ that the first loop misses. This may ideal when $N > 32$, but will actually slow things down when $N < 32$: the vectorized code doesn’t get used, and is just bloat that gets in the way. If the compiler knows that $N < 32$, it wont create the first loop. Intead, it will decide to do something different, perhaps not even creating a loop at all, but “unrolling” the entire thing. As a brief example:
using PaddedMatrices
x = @Constant randn(16); y = @Constant randn(16);
function my_dot(x, y)
d = 0.0
@fastmath @inbounds for i ∈ eachindex(x,y)
d += x[i] * y[i]
end
d
end
@code_native my_dot(x, y)
What the assembly says here is: move (vmov
) packed data (pd
= vmov_pd
) into register zmm0
from memory location %rsi
, and into zmm1
from 64
bytes down from %rsi
. Registers whose name starts with zmm
contain 64 bytes of data, i.e. 64 bytes / 8 bytes/double = 8 double precision numbers. This moves the entirety of one of the vectors into two CPU registers.
Then it multiplies packed data (vmulpd
) zmm1
with memory 64 bytes down from %rdi
, the second half of the other vector, and overwriting zmm1
.
Then it does a fused-multiplication-addition of packed data. vfmadd231
. This multiplies the start of memory %rdi
with register zmm0
, adding the result to zmm1
in one instruction. The contents of zmm1
are now:
\begin{align*}
\texttt{zmm1} &= \begin{bmatrix}
x_1 \times y_1 + x_{9} \times y_{9}\\
x_2 \times y_2 + x_{10} \times y_{10}\\
x_3 \times y_3 + x_{11} \times y_{11}\\
x_4 \times y_4 + x_{12} \times y_{12}\\
x_5 \times y_5 + x_{13} \times y_{13}\\
x_6 \times y_6 + x_{14} \times y_{14}\\
x_7 \times y_7 + x_{15} \times y_{15}\\
x_8 \times y_8 + x_{16} \times y_{16}\\
\end{bmatrix}.
\end{align*}
It then procedes to add the second half of zmm1
to the first half, so that it now has 4 numbers total. Then it does this again, and again, until in the end it has:
\begin{align*}
\texttt{zmm0} &= \begin{bmatrix}
\sum_{i=1}^{16} x_i\times y_i\\
\text{junk}\\
\text{junk}\\
\text{junk}\\
\text{junk}\\
\text{junk}\\
\text{junk}\\
\text{junk}
\end{bmatrix}
\end{align*}
Overall, this involves a lot less work for th CPU than the loop we actually wrote!
If we used an x
and a y
of different lengths, Julia will $-$ under the hood $-$ compile a different version of the functions specific to those new sizes, because the type of x
and y
are defined by their sizes.
For as long as that Julia session keeps running, Julia will “remember” both versions and not have to compile them again.
I’m using g++
as my c++
compiler, but the same ideas ought to apply. Perhaps I will try clang++
next instead, but g++
can vectorize functions like exp
and log
, while clang
cannot. CmdStan does helpfully emit .hpp
files showing the c++
code the Stan programs get translated into though, but I haven’t dove into it to try and understand what’s going on and how I might actually be able to make Stan faster.
I think it would be very hard to get Stan to be competitive in speed with my Julia code, however.
Without further delay:
stanmodel_itp2 = Stanmodel(
name = "StanITP",
Sample(
num_samples=2000,num_warmup=900,
adapt=CmdStan.Adapt(delta=0.99)
), model = itp_stan_model(T = T, K = K, D = D), nchains = 14
);
@time rc_itp2, chns_itp2, cnames_itp2 = stan(stanmodel_itp2, stan_itp_data_dict, ProjDir);
This was somehow slower. Weird.
Chains 1, 7, and 9 ran much faster than the others. Overall, convergence failed.
Thankfully, CmdStan
makes it easy to analyze arbitrary sets of chains. We just need to call stansummary
on the batches we want to analyze.
Lets look at the fast chains of the first run:
fast1 = (2,4,8)
fast2 = (6,11)
# path to the stansummary executable
stansummary = "/home/chriselrod/Documents/languages/cmdstan/bin/stansummary"
# path to where the samples were saved.
resdir = "/home/chriselrod/Documents/progwork/julia/tmp"
run(`$stansummary $(resdir)/ITP_samples_$[i for i ∈ fast1].csv`)
We see they had an average treedepth of only 2.9, averaging only 12 leapfrog steps per iteration. This is why they were fast.
They also failed to converge, with abysmal effective sample sizes.
The slow chains:
run(`$stansummary $(resdir)/ITP_samples_$[i for i ∈ 1:14 if i ∉ fast1].csv`)
That’s better! These chains converged. Their treedepth averaged 8.9, with a mean of 510 leapfrog steps per iteration.
Runtime is roughly proportional to the number of leapfrog steps.
The other noteworthy thing about adaptation is the difference in stepsize: the slow chains that converged took steps nearly 1000 times larger!
Similarly, lets look at the second run. First, the fast chains:
run(`$stansummary $(resdir)/StanITP_samples_$[i for i ∈ fast2].csv`)
We have an average of 12 leapfrog steps per iteration and a mean treedepth of 2.8. Also, total convergence failure.
The slow chains:
run(`$stansummary $(resdir)/StanITP_samples_$[i for i ∈ 1:14 if i ∉ fast2].csv`)
These did not converge. Yikes. We see there is another outlier that was slower than the rest. Let’s exclude it too:
bad2 = (fast2...,14)
run(`$stansummary $(resdir)/StanITP_samples_$[i for i ∈ 1:14 if i ∉ bad2].csv`)
There we go. These remaining 10 chains did converge. These give us reasonable numbers of effetive samples. Before moving on to Julia, let’s take a look at the slow chain that failed to converge.
run(`$stansummary $(resdir)/StanITP_samples_14.csv`)
We see it had an average treedepth of 10, reaching a steady 1023 leapfrog steps per iteration.
Also, note that stepszie! $8\times 10^{-14},$ versus the $9.8\times 10^{-3}$ of the chains that did converge. What’s a difference of more than $10^{11}$.
Despite maxing out the treedepth / number of leapfrogs and taking such small steps, the acceptance rate was also worse (0.95) than in the chains that did converge (0.99).
Given that the chains which fail to converge all share this feature $-$ a step size that is far too small $-$ it may be nice to be able to add a minimum acceptable step size. I could look into modifying the Julia code to allow enforcing that.
I am not sure that will be enough though. The energy matrix probably gets badly misspecified in the process leading to the tiny stepsizes. Enforcing a correct step size would have to allow the adaptation to recover.
The stepsize decreases during adaptation when the acceptance rate is lower than the targeted value. So, the question is, why was the acceptance rate so small? Why did it keep rejecting?
I have tried decreasing adapt_delta
(the targeted acceptance rate) in Stan, but this seemed to result in more chains failing to converge. It unfortunately takes a long time to get a large sample of chains running in which to draw accurate conclusions through the noise.
Choosing low values of adapt_delta
however appear to work well in Julia’s DynamicHMC.jl. DynamicHMC’s adaptation works differently than Stan’s by default. Stan uses a diagonal energy matrix, while DynamicHMC’s is dense. Stan adapts for 200 iterations, while DynamicHMC adapts for 900. A dense matrix has $\frac{P(P-1)}{2}$ more parameters than a diagonal matrix, where $P$ is the total number of parameters in the original model. For reference, $P=94$ for this model when $D=4, K=9$ as in the example here.
Compiling then running 14 chains in Julia:
@time chains1, tuned_samplers1 = NUTS_init_tune_distributed(ℓ_itp, 2000, δ = 0.75, report = DynamicHMC.ReportSilent());
@time chains2, tuned_samplers2 = NUTS_init_tune_distributed(ℓ_itp, 2000, δ = 0.75, report = DynamicHMC.ReportSilent());
I am rather certain Julia’s allocation tracker doesn’t see what the children are doing, and that it in fact used oodles (terabytes) of memory. Obviously not at the same time $-$ but that this means the garbage collector was rather busy, and actually ate a significant chunk of this runtime.
Either way, 75 seconds to run and compile, and then 40 seconds to run again! Pretty good. Especially compared to Stan, where many chains took 6000+ seconds to run.
The chains
we get back are raw, unconstrained prameters. That is, they’re simply a chain of vectors of length 94.
In the logistic regression example, they were vectors of length 5, and it was easy to associate the first with $\beta_0$, and the remaining 4 with $\beta_1$.
Here, we had many more parameters. It would be cumbersome to connect individual indicies of our vectors with the actual parameters they match to.
Here, many also underwent constraining transoformations. For example, $\textbf{L}$, the Cholesky factor of a correlation matrix.
Thankfully, the function constrain
will constrain our parameter vectors, producing the actual (constrained) parameters of our model:
chains = vcat(chains1, chains2)
tuned_samplers = vcat(tuned_samplers1, tuned_samplers2)
itp_samples = [constrain.(Ref(ℓ_itp), get_position.(chain)) for chain ∈ chains];
These constrained samples are named tuples. For example, the last sample from the 7th chain is:
itp_samples[7][end]
This is a named tuple. Glancing at it, we see for example that $\mu_{h2}\approx 8.94$ for this sample.
If we wanted to extract a specific parameter, say $\textbf{L}$, the Cholesky factor of a correlation matrix, all we must do is:
L_7_end = itp_samples[7][end].L # L from the first sample of the third chain
We can confirm that this is in fact the Cholesky factor of a correlation matrix:
L_7_end * L_7_end'
Note that the output type is Array{Float64,2}
. That is a standard, unsized, heap-allocated Julia array type.
These are efficient at large sizes, but not small. It was produced by a generic fallback method for matrix multiplication: a sign that I haven’t yet implement a method that autospecializes for these LKJ_Correlation_Cholesky
types.
Let’s focus our analysis on the parameters $\mu_{h1}, \mu_{h2},$ and $\rho$. We’ll look at the effective sample sizes of these parameters on each of the chains.
using MCMCDiagnostics
μₕ₁_chains = [[s.μₕ₁ for s ∈ sample] for sample ∈ itp_samples]
μₕ₂_chains = [[s.μₕ₂ for s ∈ sample] for sample ∈ itp_samples]
ρ_chains = [[s.ρ for s ∈ sample] for sample ∈ itp_samples]
poi_chains = (μₕ₁_chains, μₕ₂_chains, ρ_chains)
ess = [effective_sample_size(s[i]) for i ∈ eachindex(itp_samples), s ∈ poi_chains]
ess[1:14,:]
ess[15:end,:]
We see that 1 chains completely failed to converge, while another two had rather low effective sample sizes. I’ll filter out all three of these chains.
Let’s look at the NUTS statistics of those chains that failed to converge.
converged = vec(sum(ess, dims = 2)) .> 1000
not_converged = .! converged
NUTS_statistics.(chains[not_converged])
For the chain that completely failed, we see very shallow trees and poor acceptance rates, like with Stan.
Lets look at the corresponding stats of the first three chains that did converge:
NUTS_statistics.((chains[converged])[1:3])
Remember that the third of these produced much lower effective sample sizes for $\mu_{h_1}$ and $\mu_{h_2}$ than the first two did. But its acceptance rate looks good, and we don’t see an obvious difference in mean treedepth vs chain 2. But the depth was more variable than chains 1 or 2.
Chain 1, which had the best effective sample sizes, had a tree depth of just 7 99% of the time.
Let’s also look at the tuned smaplers:
tuned_samplers[not_converged]
tuned_samplers[converged][1:3]
The chain that completely failed had a very small stepsize, while the two that did relatively poorly had smaller step sizes than those that did converge.
Lets look at the total effective sample sizes of the chains that did converge:
poi_chain = [vcat((chains[converged])...) for chains ∈ poi_chains]
μₕ₁_chain, μₕ₂_chain, ρ_chain = poi_chain
(
μₕ₁_ess = effective_sample_size(μₕ₁_chain),
μₕ₂_ess = effective_sample_size(μₕ₂_chain),
ρ_ess = effective_sample_size(ρ_chain)
)
μₕ₁_chains, μₕ₂_chains, ρ_chains = (chains[converged] for chains ∈ poi_chains)
(
μₕ₁_r̂ = potential_scale_reduction(μₕ₁_chains...),
μₕ₂_r̂ = potential_scale_reduction(μₕ₂_chains...),
ρ_r̂ = potential_scale_reduction(ρ_chains...)
)
Our effective sample sizes and $\hat{r}$s look good.
Finally, let’s look at the major quantiles for these parameters, and compare to the true values:
using Statistics
major_quantiles = [0.05 0.25 0.5 0.75 0.95]
true_values = (μₕ₁, μₕ₂, ρ)
for i ∈ eachindex(poi_chain)
display("Major Quantiles for paramter with true values: $(true_values[i]):")
display(vcat(major_quantiles, quantile(poi_chain[i], major_quantiles)))
end
I chose $0.05$ and $0.95$ as the extreme bounds in place of the usual $0.025$ and $0.975$ to match stansummary
.