GibbsSamplingAutocorrelation








Now, lets talk about a practical reason why we may want to write our own Gibbs sampler.

The number one reason is that you can get much better sampling if you can come up with clever ways to group your parameters. In the last post, we saw that JAGS trounced our own Gibbs sampler in effectve sample size (and effective sample size / second) by recognizing it could block $y\sim\text{Binomial}(n,\theta)$ and $\theta\sim\text{Beta}(\alpha,\beta)$ together. Instead of putting them in separate blocks like we did.

Lets look at another example with an easy closed form solution:

\begin{align*}
\begin{bmatrix}w\\x\\y\\z\end{bmatrix}
&\sim\mathcal{N}\left(
\textbf{0},
\begin{bmatrix}
1 & 0.999999 & 0.999999 & 0.999999\\
0.999999 & 1 & 0.999999 & 0.999999\\
0.999999 & 0.999999 & 1 & 0.999999\\
0.999999 & 0.999999 & 0.999999 & 1
\end{bmatrix}
\right)
\end{align*}

The bad way to sample from this distribution would be to divide $w$, $x$, $y$, and $z$ into separate blocks, sampling them conditioned on the others. Taking advantage of the multivariate normality, we have

\begin{align*}
w | x, y, z &\sim \mathcal{N}\left(\tiny{
\begin{bmatrix}0.999999 & 0.999999 & 0.999999\end{bmatrix}
\begin{bmatrix}1 & 0.999999 & 0.999999\\0.999999 & 1 & 0.999999\\0.999999 & 0.999999 & 1\end{bmatrix}^{-1}\begin{bmatrix}x\\y\\z\end{bmatrix},
1 –
\begin{bmatrix}0.999999 & 0.999999 & 0.999999\end{bmatrix}
\begin{bmatrix}1 & 0.999999 & 0.999999\\0.999999 & 1 & 0.999999\\0.999999 & 0.999999 & 1\end{bmatrix}^{-1}
\begin{bmatrix}0.999999 \\ 0.999999 \\ 0.999999\end{bmatrix}}
\right)\\
w | x, y, z &\sim \mathcal{N}\left(
0.333x + 0.333y + 0.333z,
1.333\times 10^{-6}
\right)
\end{align*}

Yikes!!! The marginal variance of $w$ is 1, but the conditional variance is about $1.333\times 10^{-6}$.
The conditionals of $x$, $y$, and $z$ look similar. For good measure, lets write the sampler for the pleasure of basking in the resulting autocorrelation.

In [1]:
set.seed(123)
autocorrelation_gibbs <- function(
    mu = rep(0,4), P = length(mu),
    Sigma = matrix(rep(0.999999,P^2),ncol=P) + diag(rep(0.000001, P)),
    iter = 1e4, warmup = 1e3, inits = rnorm(P)
){
    conditional_sd <- sapply(1:P, function(p) sqrt(Sigma[p,p] - Sigma[p,-p] %*% solve(Sigma[-p,-p]) %*% Sigma[-p,p] ))
    mean_factors <- lapply(1:P, function(p) Sigma[p,-p] %*% solve(Sigma[-p,-p]) )
    samples <- matrix(double(P * iter), nrow = P)
    samples[,1] <- inits
    for (i in 1:warmup){
        for (p in 1:P){
            othervals <- samples[-p,1]
            samples[p,1] <- rnorm(1,
                                  mean = mu[p] + mean_factors[[p]] %*% (othervals - mu[-p]),
                                  sd = conditional_sd[p]
                                 )
        }
    }
    for (i in 2:iter){
        for (p in 1:P){
            othervals <- ifelse(1:(P-1) < p, samples[1:(P-1),i], samples[2:P,i-1])
            samples[p,i] <- rnorm(1,
                                  mean = mu[p] + mean_factors[[p]] %*% (othervals - mu[-p]),
                                  sd = conditional_sd[p]
                                 )
            
        }
            
    }
    samples                        
}

system.time(samples_autocor <- autocorrelation_gibbs())
   user  system elapsed 
  0.328   0.008   0.337 
In [2]:
effective_sample_size <- function(x){
    auto_corr <- acf(x, plot = FALSE, lag.max=length(x) - 1)$acf
    auto_corr_sum <- sum(auto_corr[2:(which.max(auto_corr < 0))])
    length(x) / (2*auto_corr_sum + 1)
}
w <- samples_autocor[1,]
x <- samples_autocor[2,]
y <- samples_autocor[3,]
z <- samples_autocor[4,]

c(
    SS = length(w),
    ESS_w = effective_sample_size(w),
    ESS_x = effective_sample_size(w),
    ESS_y = effective_sample_size(y),
    ESS_z = effective_sample_size(z)
)
SS
10000
ESS_w
2.94067612806035
ESS_x
2.94067612806035
ESS_y
2.94083125877033
ESS_z
2.94092512745236
In [3]:
acf(w)

That is bad. With a conditional standard deviation of about $0.001$ compared to the marginal standard deviation of 1, it takes a very long time for the samples to to walk the posterior. To demonstrate this, let’s start with bad initial values

In [4]:
samples_autocor <- autocorrelation_gibbs(inits = rep(10,4), warmup = 0)
apply(samples_autocor, 1, mean)
  1. 10.0864810840216
  2. 10.086475430207
  3. 10.086473486052
  4. 10.0864902269414

They barely managed to drift away from the initial values at all!

How can we do better?
We could instead sample directly from the multivariate distribution.

In [5]:
library(tidyverse)
rmultinorm <- function(
    mu = rep(0,4), P = length(mu),
    Sigma = matrix(rep(0.999999,P^2),ncol=P) + diag(rep(0.000001, P)),
    iter = 1e4
){
    z <- matrix(rnorm(P * iter), nrow=P)
    L <- Sigma  %>% chol %>% t
    mu + (L %*% z)
}

system.time(samples_rmultinorm <- rmultinorm())
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
   user  system elapsed 
  0.002   0.000   0.002 
In [6]:
wm <- samples_rmultinorm[1,]
xm <- samples_rmultinorm[2,]
ym <- samples_rmultinorm[3,]
zm <- samples_rmultinorm[4,]

c(
    SS = length(wm),
    ESS_w = effective_sample_size(wm),
    ESS_x = effective_sample_size(wm),
    ESS_y = effective_sample_size(ym),
    ESS_z = effective_sample_size(zm)
)
SS
10000
ESS_w
10166.6044641656
ESS_x
10166.6044641656
ESS_y
10166.7612039991
ESS_z
10166.5220312536

Now, let us consider a figuratively real world example:

NASA is given information on the projected position of space debrit, as well as estimated uncertainty in their position. They use these to see monitor potential collisions between this debrit and satellites. If the risk of a collission is unacceptably high, the satellite will spend some of its limmited fuel to maneuver out of the way.

The projections and uncertaintly are calculated regularly, giving rich datasets for each piece of debrit giving the difference between where the debrit was projected to be, and where it actually ended up being, alongside the original uncertainty in that estimate.
Here is a fake sample data set (releasing real data on their positions requires paper work) “BigPropPoints”:

In [7]:
setwd("/home/chriselrod/Documents/progwork/R")
BigPropPoints <- read.csv("BigPropPoints.csv", header = FALSE) %>% as.matrix
str(BigPropPoints)
 num [1:10, 1:1024] 0.217 0.2024 0.0936 4.9725 0.0101 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:1024] "V1" "V2" "V3" "V4" ...

The matrix is 10 x N, where N is the sample size of observations.
Each column corresponds to one observation, and the rows:

  1. Rows 1-3 give the error in each of the position coordinates.
  2. Row 4 we ignore, but for convenience’s sake, I filled it in with the squared Mahalanobis distance in my fake data.
  3. Rows 5:6 give the upper triangle of the projected covariance matrix.

To calculate the probability of two satellites colliding, NASA uses a routine that assumes multivariate normality of the position of both objects. Therefore, ideally, if the projected distance and covariance matrix were correct, we would have:

$\textbf{x} \sim \mathcal{N}_3\left(\textbf{0},\boldsymbol{\Sigma}\right).$

Therefore, the squared Mahalanobis distances would have a

$m^2 \sim \mathcal{X}^2_3$

distribution. To see why, a brief proof built on the following, which I take for granted

  1. If $z \sim \mathcal{N}(0,1)$, then $z^2 \sim \mathcal{X}^2_1$.
  2. If $w$ and $y$ are independently distributed $w\sim\mathcal{X}^2_{\nu_w}$ and $y\sim\mathcal{X}^2_{\nu_y}$ respectively, then $(w+y)\sim\mathcal{X}^2_{\nu_w+\nu_y}$.
  3. If $\textbf{x}\sim\mathcal{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$, then $\textbf{Ax}\sim\mathcal{N}\left(\textbf{A}\boldsymbol{\mu},\textbf{A}\boldsymbol{\Sigma}\textbf{A}’\right)$.

Now, letting $\textbf{L}$ be the lower triangular Cholesky decomposition of $\boldsymbol{\Sigma}$, such that $\textbf{LL}’ = \boldsymbol{\Sigma}$, then

$\begin{align*}
\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\boldsymbol{\Sigma}\right)\\
\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{LL}’\right)\\
\textbf{L}^{-1}\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{L}^{-1}\textbf{LL}’\textbf{L}’^{-1}\right)\\
\textbf{L}^{-1}\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{I}\right)
\end{align*}$

Therefore, if we let $\textbf{z} = \textbf{L}^{-1}\textbf{x}$, then $z_i$ are iid normal, and the distribution of the squared Mahalanobis distance is

$\begin{align*}
m^2 &= \textbf{x}’\boldsymbol{\Sigma}^{-1}\textbf{x}\\
&= \textbf{x}’\left(\textbf{LL}’\right)^{-1}\textbf{x}\\
&= \textbf{x}’\textbf{L}’^{-1}\textbf{L}^{-1}\textbf{x}\\
&= \left(\textbf{L}^{-1}\textbf{x}\right)’\textbf{L}^{-1}\textbf{x}\\
&= \textbf{z}’\textbf{z}\\
&= z_1^2 + z_2^2 + \dots + z_p^2\\
m &\sim \mathcal{X}^2_p.
\end{align*}$

So, let’s see how well that holds up, starting with a simple hypothesis test. Not that I condone null hypothesis significance testing. But they are an easy smell.

In [8]:
m2 <- BigPropPoints[4,]
ks.test(m2, "pchisq", 3)
	One-sample Kolmogorov-Smirnov test

data:  m2
D = 0.35083, p-value < 2.2e-16
alternative hypothesis: two-sided

Uh oh. That does not look good. Lets plot it:

In [9]:
theme_set(theme_bw())

ggplot(data.frame(mahalanobis = m2)) +
    geom_density(aes(mahalanobis), color = "red") +
    stat_function(fun = dchisq, args = list(df = 3), color = "blue", n = 1001) +
    theme_bw() +
    xlim(c(0,30))
Warning message:
“Removed 160 rows containing non-finite values (stat_density).”

The person who thinks that looks like a good fit drank too much wood alcohol.

The routine for determining the probability of collision requires multivariate normality. So we cannot abandon normality easily. And besides, why would we ever want something other than a normal distribution?

The logical route forward is to instead double down on normality: instead of using a multivariate normal distribution, we’ll use many multivariate normal distributions (in a mixture).

To simplify things, for now we will make the unnecessary assumption that the projected location is unbiased.

We will also rename the projected covariance $\textbf{S}$, to emphasize that it is not a true parameter, but another piece of data we are given; a fallible projection. (Following the convention set by Fisher, where parameters are designated by Greek letters, and data by the modern English/Latin alphabet).
Now, let

  1. $\textbf{x}$ be the miss distance
  2. $\textbf{L}$ be the lower triangular Cholesky factor of $\textbf{S}$
  3. $\textbf{y} = \textbf{L}^{-1}\textbf{x}$.

Then, we can model $\textbf{y}$ as coming from a mixture of multivariate normal distributions of mean $\textbf{0}$, and covariance $\boldsymbol{\Sigma}_g$. Additionally, let

  1. There be $G$ groups.
  2. $f\left(\textbf{y}|\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$ be the multivariate normal pdf.
  3. $\rho_g > 0, g = 1,\ldots,G,\sum_{g=1}^G\rho_g = 1$ be the probabilities of group membership.

Then the pdf of $\textbf{y}$ is

\begin{align*}
f\left(\textbf{y}|\textbf{0},\boldsymbol{\rho}_{\textbf{g}},\boldsymbol{\Sigma}_{\textbf{g}}\right)
&=
\sum_{g=1}^G \rho_g f\left(\textbf{y}|\textbf{0}, \boldsymbol{\Sigma}_{g} \right).
\end{align*}

We can give $\boldsymbol{\Sigma}_g$ an Inverse Wishart prior distribution.

Now lets create y, and define a simple JAGS model:

In [10]:
process_bpp_apply <- function(BPP){
    apply(BPP, 2, function(x){
        S <- matrix(
            c(
                x[5],x[6],x[8],
                x[6],x[7],x[9],
                x[8],x[9],x[10]
             ), ncol = 3
        )
        forwardsolve(t(chol(S)), x[1:3])
    })
}

y <- process_bpp_apply(BigPropPoints)

satelliteCode <- "model {
    pi[1:G] ~ ddirich(alpha[1:G])
    
    for (n in 1:N){
        groups[n] ~ dcat(pi[1:G])
        y[1:3,n] ~ dmnorm(zero[1:3], Lambda[1:3, 1:3, groups[n]])
    }

    for (g in 1:G){
        Lambda[1:3,1:3,g] ~ dwish(R[1:3,1:3], nu)

        RevCholWishart[3,3,g] <- sqrt(Lambda[3,3,g])
        CholInvWishart[3,3,g] <- 1 / RevCholWishart[3,3,g]
        RevCholWishart[3,1,g] <- Lambda[3,1,g] * CholInvWishart[3,3,g]
        RevCholWishart[3,2,g] <- Lambda[3,2,g] * CholInvWishart[3,3,g]
        RevCholWishart[2,2,g] <- sqrt(Lambda[2,2,g] - RevCholWishart[3,2,g]^2)
        CholInvWishart[2,2,g] <- 1 / RevCholWishart[2,2,g]
        RevCholWishart[2,1,g] <- (Lambda[2,1,g] - RevCholWishart[3,1,g]*RevCholWishart[3,2,g]) * CholInvWishart[2,2,g]
        RevCholWishart[1,1,g] <- sqrt(Lambda[1,1,g] - RevCholWishart[2,1,g]^2 - RevCholWishart[3,1,g]^2)
        CholInvWishart[1,1,g] <- 1 / RevCholWishart[1,1,g]
        RevCholWishart[1,2,g] <- 0
        RevCholWishart[1,3,g] <- 0
        RevCholWishart[2,3,g] <- 0
        CholInvWishart[1,2,g] <- 0
        CholInvWishart[1,3,g] <- 0
        CholInvWishart[2,3,g] <- 0
        CholInvWishart[2,1,g] <- - CholInvWishart[2,2,g] * RevCholWishart[2,1,g] * CholInvWishart[1,1,g]
        CholInvWishart[3,1,g] <-  RevCholWishart[3,3,g] * ( RevCholWishart[3,1,g]*RevCholWishart[1,1,g] + RevCholWishart[3,2,g]*RevCholWishart[2,1,g] )
        CholInvWishart[3,2,g] <- - RevCholWishart[3,3,g] * RevCholWishart[3,2,g] * RevCholWishart[2,2,g]
    }
}"

Please pardon the messy calculations of RevCholWishart and CholInvWishart.
The analysis I want to do requires them, so we may as well go ahead and calculate them during the sampler.

They are the Cholesky factor of the Inverse-Wishart distribution (CholInvWishart), and that matrice’s inverse (RevCholWishart; “reverse Cholesky”, because it is a lower triangular matrix $\textbf{R}$ such that $\textbf{R}’\textbf{R} = \boldsymbol{\Lambda}$, instead of the usual Cholesky factor $\textbf{L}$ that is a lower triangualr matrix such that $\textbf{LL}’ = \boldsymbol{\Lambda}$.). The above code calculates them withought needing to calculate the InverseWishart.

One final point on mixture distributions: “label switching” is a problem you always have to consider with finite mixtures. Label switching can make it hard to get estimates of what individual componets are like, and can reak havoc on convergence diagnostics.

Imagine you have univariate data, and two groups. One has mean $-2$, and the other has mean $2$. If you fit a mixture model with groups “A” and “B”, and draw 10,000 samples, it may be that for the first 5000 samples group “A” is the group with mean $-2$, and group “B” is the one with mean $2$. But then that half way through, they switch labels. Then your posterior mean for both would be about $0$, and would hurt autocorrelation: the sample’s overall variance would be inflated by a sort of “Bernoulli factor” ( $p(1-p)$ * (upper mean – lower mean) ). If this is large relative to the variability of the groups, it would result in large autocorrelations, even if those autocorrelations are small during all subsets of the chain without switching.

The problem gets more prevalent when you have multiple independent chains, that started with different initial values (as the chains should).

Thankfully, in our case, the purpose of our analysis is not to draw inference on the distributions of the mixture components, but on the entire mixture. Our assessment of convergence may still be frustrated, but our interests are label-agnostic.

In [11]:
G <- 6L
N <- ncol(y)
R <- diag(rep(1,3))
nu <- 5
zero <- rep(0, 3)
alpha <- sapply(1:G, function(g) 1.5^(G-g))
PES <- sum(alpha)
list(alpha = alpha, PES = PES, p = signif(alpha / PES, digit=5))

data <- list(y = y,
             G = G,
             N = N,
             R = R,
             nu = nu,
             zero = zero,
             alpha = alpha)
                
monitor <- c("RevCholWishart", "CholInvWishart", "pi")
library(runjags)
system.time(
    probColSamples1chain <- run.jags(
        satelliteCode,
        data = data, monitor = monitor, n.chains = 1,
        burnin = 4000, sample = 10000
    )
)
summary(probColSamples1chain)
$alpha
  1. 7.59375
  2. 5.0625
  3. 3.375
  4. 2.25
  5. 1.5
  6. 1
$PES
20.78125
$p
  1. 0.36541
  2. 0.24361
  3. 0.16241
  4. 0.10827
  5. 0.07218
  6. 0.04812
Attaching package: ‘runjags’

The following object is masked from ‘package:tidyr’:

    extract

Loading required namespace: rjags
Compiling rjags model...
Calling the simulation using the rjags method...
Note: the model did not require adaptation
Burning in the model for 4000 iterations...
Running the model for 10000 iterations...
Simulation complete
Note: Summary statistics were not produced as there are >50 monitored
variables
[To override this behaviour see ?add.summary and ?runjags.options]
FALSEFinished running the simulation
   user  system elapsed 
130.668   0.143 131.039 
Calculating summary statistics...
Note: The monitored variables 'RevCholWishart[1,2,1]',
'RevCholWishart[1,3,1]', 'RevCholWishart[2,3,1]',
'RevCholWishart[1,2,2]', 'RevCholWishart[1,3,2]',
'RevCholWishart[2,3,2]', 'RevCholWishart[1,2,3]',
'RevCholWishart[1,3,3]', 'RevCholWishart[2,3,3]',
'RevCholWishart[1,2,4]', 'RevCholWishart[1,3,4]',
'RevCholWishart[2,3,4]', 'RevCholWishart[1,2,5]',
'RevCholWishart[1,3,5]', 'RevCholWishart[2,3,5]',
'RevCholWishart[1,2,6]', 'RevCholWishart[1,3,6]',
'RevCholWishart[2,3,6]', 'CholInvWishart[1,2,1]',
'CholInvWishart[1,3,1]', 'CholInvWishart[2,3,1]',
'CholInvWishart[1,2,2]', 'CholInvWishart[1,3,2]',
'CholInvWishart[2,3,2]', 'CholInvWishart[1,2,3]',
'CholInvWishart[1,3,3]', 'CholInvWishart[2,3,3]',
'CholInvWishart[1,2,4]', 'CholInvWishart[1,3,4]',
'CholInvWishart[2,3,4]', 'CholInvWishart[1,2,5]',
'CholInvWishart[1,3,5]', 'CholInvWishart[2,3,5]',
'CholInvWishart[1,2,6]', 'CholInvWishart[1,3,6]' and
'CholInvWishart[2,3,6]' appear to be non-stochastic; they will not be
included in the convergence diagnostic
Warning message:
“Convergence cannot be assessed with only 1 chain”
Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
RevCholWishart[1,1,1] 0.59333329 0.65283881 0.718538143 0.65353926 0.032259341 NA 0.0008588658 2.7 1411 0.10315022 NA
RevCholWishart[2,1,1] 0.04144505 0.16107177 0.291342864 0.16458123 0.063611549 NA 0.0040628013 6.4 245 0.41123794 NA
RevCholWishart[3,1,1] -0.18891759 -0.05807805 0.080696552 -0.05527455 0.069502662 NA 0.0038167289 5.5 332 0.42792419 NA
RevCholWishart[1,2,1] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,2,1] 1.14560473 1.28407406 1.437247096 1.28552559 0.073483082 NA 0.0029320051 4.0 628 0.26103906 NA
RevCholWishart[3,2,1] -1.76604670 -1.36925445 -1.011340736 -1.38311285 0.193658600 NA 0.0124150859 6.4 243 0.50698346 NA
RevCholWishart[1,3,1] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,3,1] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[3,3,1] 0.92488065 1.06981155 1.228154271 1.07446194 0.077944762 NA 0.0048068846 6.2 263 0.41564789 NA
RevCholWishart[1,1,2] 0.36053068 0.50518930 0.697588578 0.51863256 0.103706958 NA 0.0111490496 10.8 87 0.70865704 NA
RevCholWishart[2,1,2] -0.52429294 -0.26329209 0.007176825 -0.26683897 0.134180553 NA 0.0073472922 5.5 334 0.43391011 NA
RevCholWishart[3,1,2] -1.10420523 -0.44480547 0.702719805 -0.40649131 0.416067493 NA 0.0737136800 17.7 32 0.85618004 NA
RevCholWishart[1,2,2] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,2,2] 0.76896333 1.58454452 2.459810971 1.61069956 0.465456019 NA 0.0432754747 9.3 116 0.67636155 NA
RevCholWishart[3,2,2] -1.18123317 -0.18597125 0.869220434 -0.16606926 0.492418060 NA 0.0288980051 5.9 290 0.49187319 NA
RevCholWishart[1,3,2] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,3,2] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[3,3,2] 0.65428948 1.03825551 1.911240404 1.14185285 0.364545587 NA 0.0326524777 9.0 125 0.69308193 NA
RevCholWishart[1,1,3] 0.13108006 0.14475564 0.158526129 0.14476716 0.007036883 NA 0.0001378708 2.0 2605 0.04385236 NA
RevCholWishart[2,1,3] 0.26800938 0.30538582 0.343122977 0.30587925 0.019375501 NA 0.0005624035 2.9 1187 0.10157488 NA
RevCholWishart[3,1,3] 0.17725552 0.23199291 0.288185903 0.23133339 0.028246174 NA 0.0008444113 3.0 1119 0.12846728 NA
RevCholWishart[1,2,3] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,2,3] 0.47303580 0.52596012 0.581432054 0.52678538 0.027443051 NA 0.0007257037 2.6 1430 0.08492463 NA
RevCholWishart[3,2,3] -0.58732528 -0.49074932 -0.401516968 -0.49214962 0.047353041 NA 0.0014375087 3.0 1085 0.11713923 NA
RevCholWishart[1,3,3] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[2,3,3] 0.00000000 0.00000000 0.000000000 0.00000000 0.000000000 0 NA NA NA NA NA
RevCholWishart[3,3,3] 0.59600416 0.65746233 0.724652896 0.65779859 0.032859266 NA 0.0005190433 1.6 4008 0.02554498 NA
RevCholWishart[1,1,4] 0.29721601 0.62448611 1.543925283 0.74230524 0.371458113 NA 0.0451191332 12.1 68 0.65600201 NA
RevCholWishart[2,1,4] -1.31410249 -0.23342812 1.642741822 -0.16996990 0.633355641 NA 0.0537282697 8.5 139 0.62232754 NA
RevCholWishart[3,1,4] -1.91523153 0.46657601 1.549334672 0.19190476 0.920048433 NA 0.1254708091 13.6 54 0.76980089 NA
CholInvWishart[1,2,4] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,2,4] 2.935010e-01 0.82818681 1.27670322 0.80978430 0.29212854 NA 0.0376925508 12.9 60 0.68799406 NA
CholInvWishart[3,2,4] -6.018247e+00 0.13694248 5.68633416 -0.07763035 2.64804996 NA 0.1266032634 4.8 437 0.25688767 NA
CholInvWishart[1,3,4] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,3,4] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[3,3,4] 2.823146e-01 0.67271606 1.22862946 0.71049802 0.26944210 NA 0.0230644408 8.6 136 0.67810602 NA
CholInvWishart[1,1,5] 3.085498e-01 1.27710777 2.40378043 1.32464981 0.60703998 NA 0.0502353817 8.3 146 0.57575896 NA
CholInvWishart[2,1,5] -1.281844e+00 0.12924489 1.08626553 0.06289763 0.63003042 NA 0.0475349907 7.5 176 0.58696849 NA
CholInvWishart[3,1,5] -7.068269e+00 0.30798377 6.79333215 0.08148730 3.20790791 NA 0.1072933926 3.3 894 0.07120210 NA
CholInvWishart[1,2,5] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,2,5] 2.865773e-01 0.68944658 1.27426287 0.73391373 0.29258853 NA 0.0279010056 9.5 110 0.51162999 NA
CholInvWishart[3,2,5] -7.926700e+00 0.11661672 7.62054053 -0.05858076 3.69472106 NA 0.1185203007 3.2 972 0.12613903 NA
CholInvWishart[1,3,5] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,3,5] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[3,3,5] 2.640530e-01 0.54171456 1.15169201 0.61130378 0.25714174 NA 0.0223742763 8.7 132 0.57448869 NA
CholInvWishart[1,1,6] 2.997205e-01 0.96289109 2.67799687 1.19571708 0.73875264 NA 0.0804953472 10.9 84 0.65040586 NA
CholInvWishart[2,1,6] -1.304394e+00 0.09295717 1.04166574 0.04862127 0.59410339 NA 0.0305034988 5.1 379 0.38453740 NA
CholInvWishart[3,1,6] -8.991965e+00 -0.03514051 7.60185021 -0.08439977 3.83167950 NA 0.0809910293 2.1 2238 0.07010276 NA
CholInvWishart[1,2,6] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,2,6] 2.564553e-01 0.57664853 1.20984218 0.64901253 0.27908346 NA 0.0136184235 4.9 420 0.26324648 NA
CholInvWishart[3,2,6] -8.657267e+00 -0.03193687 8.71191949 -0.18642239 4.12681004 NA 0.1076377902 2.6 1470 0.09131084 NA
CholInvWishart[1,3,6] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[2,3,6] 0.000000e+00 0.00000000 0.00000000 0.00000000 0.00000000 0 NA NA NA NA NA
CholInvWishart[3,3,6] 2.433381e-01 0.50284103 1.03639952 0.56351556 0.23258441 NA 0.0127730116 5.5 332 0.29682426 NA
pi[1] 3.580098e-01 0.48163329 0.58418771 0.47541469 0.05864633 NA 0.0048613827 8.3 146 0.70241723 NA
pi[2] 4.019151e-02 0.12985234 0.24663637 0.13702114 0.05573446 NA 0.0055819657 10.0 100 0.78042955 NA
pi[3] 2.224569e-01 0.26015467 0.30065110 0.26022123 0.01985367 NA 0.0005923386 3.0 1123 0.11838110 NA
pi[4] 3.510240e-03 0.05816868 0.14154935 0.06573845 0.04009020 NA 0.0046808167 11.7 73 0.80956206 NA
pi[5] 1.250678e-04 0.02752870 0.11188551 0.03928700 0.03537422 NA 0.0058237478 16.5 37 0.85540395 NA
pi[6] 3.891006e-07 0.01378290 0.07592672 0.02231749 0.02550848 NA 0.0030467317 11.9 70 0.82463029 NA
In [12]:
probcol_df <- as.data.frame(probColSamples1chain$mcmc[[1]])
c(
    ESS_rcw111 = effective_sample_size(probcol_df[["RevCholWishart[1,1,1]"]]),
    ESS_rcw211 = effective_sample_size(probcol_df[["RevCholWishart[2,1,1]"]]),
    ESS_rcw311 = effective_sample_size(probcol_df[["RevCholWishart[3,1,1]"]]),
    ESS_rcw221 = effective_sample_size(probcol_df[["RevCholWishart[2,2,1]"]]),
    ESS_rcw321 = effective_sample_size(probcol_df[["RevCholWishart[3,2,1]"]]),
    ESS_rcw331 = effective_sample_size(probcol_df[["RevCholWishart[3,3,1]"]])
)
ESS_rcw111
828.298426688448
ESS_rcw211
164.37482102413
ESS_rcw311
111.605257391113
ESS_rcw221
203.171379465751
ESS_rcw321
172.883353812559
ESS_rcw331
198.868816090994

Took 138 seconds, cool. We’ll take a look at some of these samples later to assess how well it fits.

Regardless, we have to analyze tens of thousands of such data sets. Analyzing the data sets in parallel, it would still take a very long time.
Can we write a more efficient sampler? Of course!

We can split up our parameters into three blocks:

  1. The (Inverse-)Wisharts $(\boldsymbol{\Sigma})\ \boldsymbol{\Lambda}$. Conditioning on the group membership of each $\textbf{y}_n$, the (Inverse-)Wisharts are conjugate with the normal distribution on $\textbf{y}_n$.
  2. The proportion beloning to each group, $\boldsymbol{\pi}$. Conditioning on the number of observations in each group, the Dirichlet-categorical distribution is conjugate.
  3. Group memberships. Conditioning on the proportions and inverse wisharts of each group, we can calculate the individual likelihood of each $\textbf{y}$ belonging in each group. Categorical distributions are straightforward to sample from.

Further, we can also simplify the model by marginalizing out the (Inverse-)Wisharts, and replacing the multivariate normal with a multivariate t distribution.

So, let’s write our own and see how it does.