BYOGProbColInto

I will rewrite the JAGS model in Julia. The reason for this is that we have a lot of little arithmetic operations, which a compiled language (like Julia or C++) can make run much more quickly.

For now, I will only use code in base Julia and registered libraries that you can install simply via entering the Pkg mode (via pressing ] from the Julia REPL) and typing add PACKAGENAME.

This code will be written quickly / naively, without paying attention to performance, other than making sure the code is “type stable”.

In another post, I will discuss how to optimize the code, and see just how much more performance we can squeeze out of it. The optimization will rely on a host of libraries I’ve written that provide optimized versions of various functions, or tools that make optimization easier.

First, simply processing the inputs as we did in R to generate $\textbf{Y}$:

In [1]:
using StaticArrays, LinearAlgebra

function process_inputs!(Y, Data)
@inbounds for i ∈ 1:size(Data,1)
S = Symmetric(SMatrix{3,3}(
Data[i,5], Data[i,6], Data[i,8],
Data[i,6], Data[i,7], Data[i,9],
Data[i,8], Data[i,9], Data[i,10]
))

x = SVector(Data[i,1], Data[i,2], Data[i,3])

Y[i,:] .= inv(cholesky(S).L.data) * x
end
end

Out[1]:
process_inputs! (generic function with 1 method)

Earlier, I said we can break our sampling into three blocks:

We can split up our parameters into three blocks:

1. The (Inverse-)Wisharts $(\boldsymbol{\Sigma})\ \boldsymbol{\Lambda}$. Conditioning on the group membership of each $\textbf{y}_n$, the (Inverse-)Wisharts are conjugate with the normal distribution on $\textbf{y}_n$.
2. The proportion beloning to each group, $\boldsymbol{pi}$. Conditioning on the number of observations in each group, the Dirichlet-categorical distribution is conjugate.
3. Group memberships. Conditioning on the proportions and inverse wisharts of each group, we can calculate the individual likelihood of each $\textbf{y}$ belonging in each group. Categorical distributions are straightforward to sample from.

So, let us start with 1.. The posterior in a Normal-Inverse-Wishart model. Given:
\begin{align*}
\textbf{y}_i &\sim \mathcal{N}\left( \boldsymbol{\mu}, \boldsymbol{\Sigma} \right)\\
\boldsymbol{\mu} &\sim \mathcal{N}\left( \boldsymbol{\mu}_0, \frac{1}{\lambda} \boldsymbol{\Sigma} \right)\\
\boldsymbol{\Sigma} &\sim \text{Inverse-Wishart}\left(\boldsymbol{\Psi},\nu \right)
\end{align*}

Setting $\boldsymbol{\mu}$ to $\textbf{0}$, we are essentially taking the limit as $\lambda\rightarrow \infty$.

Then the posteriors for $\boldsymbol{\mu}$ and $\lambda$ are still $\textbf{0}$ and $\infty$, respectively. The remaining posteriors:
\begin{align*}
\nu_1 &= \nu_0 + n\\
\end{align*}
\begin{align*}
\boldsymbol{\Psi}_1 &= \boldsymbol{\Psi}_0 + \sum_{n=1}^N
\left(\textbf{y}_i-\bar{\textbf{y}}\right)
\left(\textbf{y}_i-\bar{\textbf{y}}\right)’ +
\underset{\lambda\rightarrow\infty}{\text{limit}}
\frac{\lambda n}{\lambda + n}
\left( \bar{\textbf{y}} – \boldsymbol{\mu}_0 \right)
\left( \bar{\textbf{y}} – \boldsymbol{\mu}_0 \right)’
\\
&= \boldsymbol{\Psi}_0 + \sum_{n=1}^N\left(\textbf{y}_i-\bar{\textbf{y}}\right)\left(\textbf{y}_i-\bar{\textbf{y}}\right)’ +
n\left( \bar{\textbf{y}} \right)
\left( \bar{\textbf{y}} \right)’
\\
&= \boldsymbol{\Psi}_0 +
\sum_{n=1}^N
\left(
\textbf{y}_i\textbf{y}_i’ -\textbf{y}_i\bar{\textbf{y}}’ – \bar{\textbf{y}}\textbf{y}_i’ + \bar{\textbf{y}}\bar{\textbf{y}}’
\right) +
n\bar{\textbf{y}}
\bar{\textbf{y}}’
\\
&= \boldsymbol{\Psi}_0 + \sum_{n=1}^N
\textbf{y}_i\textbf{y}_i’ –
2n\bar{\textbf{y}}
\bar{\textbf{y}}’ +
2n\bar{\textbf{y}}
\bar{\textbf{y}}’
\\
&= \boldsymbol{\Psi}_0 +
\sum_{n=1}^N
\textbf{y}_i\textbf{y}_i’
\end{align*}

Therefore, we can precompute $\textbf{y}_i\textbf{y}_i’$. Then, for each of the groups, we will sum up the members of those groups.

In [2]:
function fill_rank_one_covariances!(rank_one_covs, Y)
@inbounds for i ∈ eachindex(rank_one_covs)
y = SVector(Y[i,1],Y[i,2],Y[i,3])
rank_one_covs[i] = y * y'
end
end

Out[2]:
fill_rank_one_covariances! (generic function with 1 method)

Now, we’ll define a vector to indicate groups, another to hold the InverseWisharts.
We’ll also calculate how many were in each group.

In [3]:
function set_to_identity!(Ψs)
for i ∈ eachindex(Ψs)
Ψs[i] = eltype(Ψs)(I)
end
end

function calc_Ψs!(Ψs, group_counts, rank_one_covs, groups)
set_to_identity!(Ψs)
group_counts .= 0
@inbounds for i ∈ eachindex(rank_one_covs, groups)
Ψs[groups[i]] += rank_one_covs[i]
group_counts[groups[i]] += 1
end
end

Out[3]:
calc_Ψs! (generic function with 1 method)

Ψs is a vector containing the InverseWishart matrices we’re summing up. We look up the $i$th group, access that element of the vector, and add the $i$th rank-1 covariance matrix to it. We also add 1 to the corresponding group count.

Moving onto number 2, we to need to sample group proportions. The posterior is a Dirichlet, with posterior parameters $\alpha$ equal to the prior plus group counts (same as the Beta-Binomial posterior).

For $\boldsymbol{\alpha}_0$, we will set the $g$th element to $1.5^{G-g}$, where $G$ is the number of groups.

In [4]:
using Distributions

function sample_group_proportions!(α₀, group_counts)
G = length(α₀)
α₀ .= group_counts .+ 1.5 .^ ((G-1):-1:0)
rand(Dirichlet(α₀))
end

WARNING: using Distributions.mode in module Main conflicts with an existing identifier.

Out[4]:
sample_group_proportions! (generic function with 1 method)

Finally, we need to sample group memberships. Here, rather than sampling Inverse Wisharts from $\boldsymbol{\Psi}$ and evaluating the normal likelihood, we will marginalize out the Inverse-Wishart, getting a multivariate $t$ distribution, and calculate the likelihood.

Then, the probability of the $n$th observatio belonging in the $g$th group, conditioning on the multivariate $t$ distributions and proportion belonging to each group is:
\begin{align*}
\frac{\pi_g t_g( y_n )}{
\sum_{k=1}^G \pi_k t_k( y_n )
},
\end{align*}
where $\pi_g$ and $t_g$ are the $g$th group proportion and multivariate $t$ distributions, respectively.

We’re letting $\nu_0$, the Inverse-Wishart prior degrees of freedom, be 5. This is so
The multivariate $t$ will then have $\nu_1 – 2 = N + 3$ degrees of freedom.

In [5]:
using SpecialFunctions

function sample_groups!(groups, Rs, Ls, indiv_π, baseline_π, πs, Y, Ψs, group_counts)
P = 3 # dimensions of multivariate t
for g ∈ eachindex(Ψs)
Ls[g] = cholesky(Symmetric(Ψs[g])).L
Rs[g] = LowerTriangular(inv(Ls[g].data))
ν = group_counts[g] + P
# part of t logpdf not a function of Y
baseline_π[g] = logdet(Rs[g]) + lgamma((ν + P)/2) - lgamma(ν/2) - (P/2) * log(ν)
# add log of group proportion
baseline_π[g] += log(πs[g])
end
@inbounds for n ∈ eachindex(groups)
y =  SVector(Y[n,1], Y[n,2], Y[n,3])
for g ∈ eachindex(Rs)
ν = group_counts[g] + P
Ry = Rs[g] * y
indiv_π[g] = exp(baseline_π[g] -(ν + P)/2 * log(1 + Ry' * Ry))
end
indiv_π ./= sum(indiv_π)
groups[n] = rand(Categorical(indiv_π))
end
end

Out[5]:
sample_groups! (generic function with 1 method)

Now, to run the Gibbs sampler, we just need to just our data (generating the rank-1 covariance matrices, for example), and then repeatedly call these three functions.

We want to monitor the variables Rs, Ls, πs, and group_counts.
To we’ll pass in $G\times M$ matrices for each, where $M$ is the desired number of MCMC iterations.

In [17]:
function covariance_realism_gibbs!(
Rs, Ls, πs, group_counts, # monitored variables
rank_one_covs, Y,
groups, α, indiv_π, baseline_π, Ψs, # other variables that get overridden
BPP, warmup # data and mcmc args
)

num_groups, mcmc_iter = size(Rs)

# Calculate Y
process_inputs!(Y, BPP)
# construct rank 1 covariances
fill_rank_one_covariances!(rank_one_covs, Y)

# initialize groups
baseline_π .= 1.5 .^ ((num_groups-1):-1:0)
baseline_π ./= sum(baseline_π)
base_cat = Categorical(baseline_π)
@inbounds for n ∈ eachindex(groups)
groups[n] = rand(base_cat)
end

# warmup
@inbounds for i ∈ 1:warmup
@views calc_Ψs!(Ψs, group_counts[:,1], rank_one_covs, groups)
@views πs[:,1] .= sample_group_proportions!(α, group_counts[:,1])
@views sample_groups!(groups, Rs[:,1], Ls[:,1], indiv_π, baseline_π, πs[:,1], Y, Ψs, group_counts[:,1])
end
# sample
@inbounds for i ∈ 2:mcmc_iter
@views calc_Ψs!(Ψs, group_counts[:,i], rank_one_covs, groups)
@views πs[:,i] = sample_group_proportions!(α, group_counts[:,i])
@views sample_groups!(groups, Rs[:,i], Ls[:,i], indiv_π, baseline_π, πs[:,i], Y, Ψs, group_counts[:,i])
end
end

Out[17]:
covariance_realism_gibbs! (generic function with 1 method)

Now, let’s try running the sampler.

In [7]:
using DelimitedFiles
BigPropPoints = readdlm(joinpath("/home/chriselrod/Documents/progwork/R", "BigPropPoints.csv"), ',', Float64)
BPP32 = Float32.(BigPropPoints')

num_groups = 6
N = size(BPP32, 1)

warmup = 4000
mcmc_iter = 10^4

Y = Matrix{Float32}(undef, N, 3)

const SMT = SMatrix{3,3,Float32,9}

Rs = Matrix{LowerTriangular{Float32,SMT}}(undef, num_groups, mcmc_iter)
Ls = Matrix{LowerTriangular{Float32,SMT}}(undef, num_groups, mcmc_iter)
πs = Matrix{Float32}(undef, num_groups, mcmc_iter)
group_counts = Matrix{Int}(undef, num_groups, mcmc_iter)

rank_one_covs = Vector{SMT}(undef, N)

groups = Vector{Int}(undef, N)
α₀ = Vector{Float32}(undef, num_groups)
indiv_π = similar(α₀)
baseline_π = similar(α₀)
Ψs = Vector{SMT}(undef, num_groups)

Out[7]:
6-element Array{SArray{Tuple{3,3},Float32,2,9},1}:
[1.4013e-45 0.0 3.78351e-44; 0.0 4.2039e-45 0.0; 2.8026e-45 0.0 5.60519e-45]
[0.0 7.00649e-45 0.0; 3.92364e-44 0.0 9.80909e-45; 0.0 8.40779e-45 0.0]
[1.12104e-44 0.0 1.4013e-44; 0.0 1.26117e-44 0.0; 2.8026e-44 0.0 1.54143e-44]
[0.0 1.82169e-44 0.0; 1.68156e-44 0.0 2.10195e-44; 0.0 1.96182e-44 0.0]
[2.24208e-44 0.0 2.66247e-44; 0.0 2.52234e-44 0.0; 2.38221e-44 0.0 2.94273e-44]
[0.0 3.22299e-44 0.0; 3.08286e-44 0.0 3.50325e-44; 0.0 3.36312e-44 0.0]        
In [18]:
@time covariance_realism_gibbs!(
Rs, Ls, πs, group_counts, # monitored variables
rank_one_covs, Y,
groups, α₀, indiv_π, baseline_π, Ψs, # other variables that get overridden
BPP32, warmup # data and mcmc args
)

  2.896678 seconds (14.87 M allocations: 458.801 MiB, 3.75% gc time)


The first time we run a function, it compiles. That is important if we are running a model once.

“Compiling” refers to the process of turning our text into binary code the processor runs. In Julia, the first time a function is called for a given combination of argument types, a corresponding version of that function is compiled. Then, future calls with that function-type combination will call that compiled version.
Compilers can work a lot of magic on written code to make that code run faster while giving the same (or, optionally, approximately the same) answer.
Languages like R and Python are not compiled, but run through an interpreter. This means code is constantly “interpreted” while it is being run, which adds time while running, and also that no optimizations to what you’ve written are taking place.

Note that if a function itself doesn’t need many computations to run, it could spend more time compiling and optimizing than it takes to run once. But, if we want to run the function many times, or if the function takes a long time to run, that cost of compiling can be amortized. Who cares about 2 seconds of compiling when it takes 2 minutes to run? Or 2 hours?

The library BenchmarkTools will run a function multiple times, and not report the time spent compiling. This can give us an idea of how long it would take to run a simulation.

In [19]:
using BenchmarkTools
@benchmark covariance_realism_gibbs!(
$Rs,$Ls, $πs,$group_counts, # monitored variables
$rank_one_covs,$Y,
$groups,$α₀, $indiv_π,$baseline_π, $Ψs, # other variables that get overridden$BPP32, \$warmup # data and mcmc args
)

Out[19]:
BenchmarkTools.Trial:
memory estimate:  445.57 MiB
allocs estimate:  14577906
--------------
minimum time:     2.816 s (3.27% GC)
median time:      2.817 s (3.29% GC)
mean time:        2.817 s (3.29% GC)
maximum time:     2.818 s (3.30% GC)
--------------
samples:          2
evals/sample:     1

JAGS took about 138 seconds, so this is about 50 times faster!
Really good for we didn’t try to make fast.

Lets analyze the posterior predictive distribution.

To do that, we’ll use a weighted kernel density estimate of the Mahalanobis distances. Most of the KDE functionality is from the library KernelDensity, but I wrap it to provide a few extra convenience features.

In [10]:
using Interpolations, KernelDensity, Statistics, StatsBase, Gadfly

struct KDE{T,ITP <: ScaledInterpolation} <: ContinuousUnivariateDistribution
x::StepRangeLen{T,Base.TwicePrecision{T},Base.TwicePrecision{T}}
density::Vector{T}
cumulative_density::Vector{T}
pdf::ITP
cdf::ITP
end
function KDE(kde::UnivariateKDE)
x = kde.x
density = kde.density
cumulative_density = cumsum(density)
cumulative_density ./= cumulative_density[end]

KDE(x, density, cumulative_density, pdf, cdf)
end
KDE(distances::AbstractVector) = KDE(kde(distances))
KDE(distances::AbstractVector, weights::AbstractVector) = KDE(kde(distances, weights = weights))

# Parameters are summarized by the full (x, density) set
StatsBase.params(kde::KDE) = (kde.x, kde.density)
Statistics.mean(kde::KDE{T}) where T = kde.x' * kde.density * T(kde.x.step)
function Statistics.var(kde::KDE{T}) where T
μ = zero(T)
σ² = zero(T)
x = kde.x; density = kde.density
@inbounds @simd for i ∈ eachindex(x, density)
xd = x[i] * density[i]
μ += xd
σ² += x[i] * xd
end
σ² * T(kde.x.step) - (μ * T(kde.x.step))^2
end
Statistics.std(kde::KDE) = sqrt(var(kde))
function Statistics.median(kde::KDE)
kde.x[findfirst(p -> p > 0.5, kde.cumulative_density)]
end
function StatsBase.mode(kde::KDE)
kde.x[argmax(kde.density)]
end
function StatsBase.entropy(kde::KDE{T}) where T
out = zero(T)
density = kde.density
@inbounds for i ∈ eachindex(density)
out -= density[i] * log(density[i])
end
out * T(kde.x.step)
end
function Distributions.pdf(kde::KDE, x::Real)
if (x < minimum(kde.x)) || (x > maximum(kde.x))
p = zero(T)
else
p = kde.pdf(x)
end
end
Distributions.logpdf(kde::KDE, x::Real) = log(kde.pdf(x))
function Distributions.cdf(kde::KDE{T}, x::Real) where T
if x < minimum(kde.x)
p = zero(T)
elseif x > maximum(kde.x)
p = one(T)
else
p = kde.cdf(x)
end
p
end

layer(x = kde.x, y = kde.density, Geom.line, args...)
end

WARNING: using StatsBase.mode in module Main conflicts with an existing identifier.


Sampling distances from the posterior:

In [11]:
function sample_distances!(distances::AbstractVector{T}, Ls, group_counts) where T
P = 3
@inbounds for i ∈ eachindex(distances)
z = @SVector randn(T, 3)
y = Ls[i] * z
u = T(rand(Chisq(group_counts[i] + P)))
distances[i] = sqrt( y' * y / u )
end
end
function sample_distances(Ls::AbstractArray{LowerTriangular{T,SMatrix{3,3,T,9}}}, group_counts) where {T}
distances = Vector{T}(undef, length(Ls))
sample_distances!(distances, Ls, group_counts)
distances
end

Out[11]:
sample_distances (generic function with 1 method)

Now we can plot our posterior predictive, to compare it with the samples:

In [12]:
dist_kde = KDE(sample_distances(Ls, group_counts), vec(πs))
mahalanobis_distances = sqrt.(vec(sum(Y .* Y, dims = 2)))

plot(
layer(dist_kde),
layer(x = mahalanobis_distances, Geom.density, Theme(default_color="orange")),
Coord.cartesian(xmin=0, xmax=10)
)

Out[12]:

Note that some of the differences come from oversmoothing of the observed mahalanobis distances: the density goes below zero, even though all the distances are positive.

Or we can test if the observed data seems to match the KDE posterior approximation via the Anderson Darling test:

In [23]:
using HypothesisTests

Out[23]:
One sample Anderson-Darling test
--------------------------------
Population details:
parameter of interest:   not implemented yet
value under h_0:         NaN
point estimate:          NaN

Test summary:
outcome with 95% confidence: fail to reject h_0
one-sided p-value:           0.4162

Details:
number of observations:   1024
sample mean:              3.6904406547546387
sample SD:                4.148375034332275
A² statistic:             0.8968557160468293


Of course, to really validate the model (and our implementation), it would be better to run simulations and/or employ techiques like cross validation.

For good measure, we can also compare the sample mean and standard deviation to those of our KDE:

In [24]:
mean(dist_kde), std(dist_kde)

Out[24]:
(3.6171568270583685, 4.178032962216361)