Now, lets talk about a practical reason why we may want to write our own Gibbs sampler.
The number one reason is that you can get much better sampling if you can come up with clever ways to group your parameters. In the last post, we saw that JAGS trounced our own Gibbs sampler in effectve sample size (and effective sample size / second) by recognizing it could block $y\sim\text{Binomial}(n,\theta)$ and $\theta\sim\text{Beta}(\alpha,\beta)$ together. Instead of putting them in separate blocks like we did.
Lets look at another example with an easy closed form solution:
\begin{align*}
\begin{bmatrix}w\\x\\y\\z\end{bmatrix}
&\sim\mathcal{N}\left(
\textbf{0},
\begin{bmatrix}
1 & 0.999999 & 0.999999 & 0.999999\\
0.999999 & 1 & 0.999999 & 0.999999\\
0.999999 & 0.999999 & 1 & 0.999999\\
0.999999 & 0.999999 & 0.999999 & 1
\end{bmatrix}
\right)
\end{align*}
The bad way to sample from this distribution would be to divide $w$, $x$, $y$, and $z$ into separate blocks, sampling them conditioned on the others. Taking advantage of the multivariate normality, we have
\begin{align*}
w | x, y, z &\sim \mathcal{N}\left(\tiny{
\begin{bmatrix}0.999999 & 0.999999 & 0.999999\end{bmatrix}
\begin{bmatrix}1 & 0.999999 & 0.999999\\0.999999 & 1 & 0.999999\\0.999999 & 0.999999 & 1\end{bmatrix}^{-1}\begin{bmatrix}x\\y\\z\end{bmatrix},
1 –
\begin{bmatrix}0.999999 & 0.999999 & 0.999999\end{bmatrix}
\begin{bmatrix}1 & 0.999999 & 0.999999\\0.999999 & 1 & 0.999999\\0.999999 & 0.999999 & 1\end{bmatrix}^{-1}
\begin{bmatrix}0.999999 \\ 0.999999 \\ 0.999999\end{bmatrix}}
\right)\\
w | x, y, z &\sim \mathcal{N}\left(
0.333x + 0.333y + 0.333z,
1.333\times 10^{-6}
\right)
\end{align*}
Yikes!!! The marginal variance of $w$ is 1, but the conditional variance is about $1.333\times 10^{-6}$.
The conditionals of $x$, $y$, and $z$ look similar. For good measure, lets write the sampler for the pleasure of basking in the resulting autocorrelation.
set.seed(123)
autocorrelation_gibbs <- function(
mu = rep(0,4), P = length(mu),
Sigma = matrix(rep(0.999999,P^2),ncol=P) + diag(rep(0.000001, P)),
iter = 1e4, warmup = 1e3, inits = rnorm(P)
){
conditional_sd <- sapply(1:P, function(p) sqrt(Sigma[p,p] - Sigma[p,-p] %*% solve(Sigma[-p,-p]) %*% Sigma[-p,p] ))
mean_factors <- lapply(1:P, function(p) Sigma[p,-p] %*% solve(Sigma[-p,-p]) )
samples <- matrix(double(P * iter), nrow = P)
samples[,1] <- inits
for (i in 1:warmup){
for (p in 1:P){
othervals <- samples[-p,1]
samples[p,1] <- rnorm(1,
mean = mu[p] + mean_factors[[p]] %*% (othervals - mu[-p]),
sd = conditional_sd[p]
)
}
}
for (i in 2:iter){
for (p in 1:P){
othervals <- ifelse(1:(P-1) < p, samples[1:(P-1),i], samples[2:P,i-1])
samples[p,i] <- rnorm(1,
mean = mu[p] + mean_factors[[p]] %*% (othervals - mu[-p]),
sd = conditional_sd[p]
)
}
}
samples
}
system.time(samples_autocor <- autocorrelation_gibbs())
effective_sample_size <- function(x){
auto_corr <- acf(x, plot = FALSE, lag.max=length(x) - 1)$acf
auto_corr_sum <- sum(auto_corr[2:(which.max(auto_corr < 0))])
length(x) / (2*auto_corr_sum + 1)
}
w <- samples_autocor[1,]
x <- samples_autocor[2,]
y <- samples_autocor[3,]
z <- samples_autocor[4,]
c(
SS = length(w),
ESS_w = effective_sample_size(w),
ESS_x = effective_sample_size(w),
ESS_y = effective_sample_size(y),
ESS_z = effective_sample_size(z)
)
acf(w)
That is bad. With a conditional standard deviation of about $0.001$ compared to the marginal standard deviation of 1, it takes a very long time for the samples to to walk the posterior. To demonstrate this, let’s start with bad initial values
samples_autocor <- autocorrelation_gibbs(inits = rep(10,4), warmup = 0)
apply(samples_autocor, 1, mean)
They barely managed to drift away from the initial values at all!
How can we do better?
We could instead sample directly from the multivariate distribution.
library(tidyverse)
rmultinorm <- function(
mu = rep(0,4), P = length(mu),
Sigma = matrix(rep(0.999999,P^2),ncol=P) + diag(rep(0.000001, P)),
iter = 1e4
){
z <- matrix(rnorm(P * iter), nrow=P)
L <- Sigma %>% chol %>% t
mu + (L %*% z)
}
system.time(samples_rmultinorm <- rmultinorm())
wm <- samples_rmultinorm[1,]
xm <- samples_rmultinorm[2,]
ym <- samples_rmultinorm[3,]
zm <- samples_rmultinorm[4,]
c(
SS = length(wm),
ESS_w = effective_sample_size(wm),
ESS_x = effective_sample_size(wm),
ESS_y = effective_sample_size(ym),
ESS_z = effective_sample_size(zm)
)
Now, let us consider a figuratively real world example:
NASA is given information on the projected position of space debrit, as well as estimated uncertainty in their position. They use these to see monitor potential collisions between this debrit and satellites. If the risk of a collission is unacceptably high, the satellite will spend some of its limmited fuel to maneuver out of the way.
The projections and uncertaintly are calculated regularly, giving rich datasets for each piece of debrit giving the difference between where the debrit was projected to be, and where it actually ended up being, alongside the original uncertainty in that estimate.
Here is a fake sample data set (releasing real data on their positions requires paper work) “BigPropPoints”:
setwd("/home/chriselrod/Documents/progwork/R")
BigPropPoints <- read.csv("BigPropPoints.csv", header = FALSE) %>% as.matrix
str(BigPropPoints)
The matrix is 10 x N, where N is the sample size of observations.
Each column corresponds to one observation, and the rows:
- Rows 1-3 give the error in each of the position coordinates.
- Row 4 we ignore, but for convenience’s sake, I filled it in with the squared Mahalanobis distance in my fake data.
- Rows 5:6 give the upper triangle of the projected covariance matrix.
To calculate the probability of two satellites colliding, NASA uses a routine that assumes multivariate normality of the position of both objects. Therefore, ideally, if the projected distance and covariance matrix were correct, we would have:
$\textbf{x} \sim \mathcal{N}_3\left(\textbf{0},\boldsymbol{\Sigma}\right).$
Therefore, the squared Mahalanobis distances would have a
$m^2 \sim \mathcal{X}^2_3$
distribution. To see why, a brief proof built on the following, which I take for granted
- If $z \sim \mathcal{N}(0,1)$, then $z^2 \sim \mathcal{X}^2_1$.
- If $w$ and $y$ are independently distributed $w\sim\mathcal{X}^2_{\nu_w}$ and $y\sim\mathcal{X}^2_{\nu_y}$ respectively, then $(w+y)\sim\mathcal{X}^2_{\nu_w+\nu_y}$.
- If $\textbf{x}\sim\mathcal{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$, then $\textbf{Ax}\sim\mathcal{N}\left(\textbf{A}\boldsymbol{\mu},\textbf{A}\boldsymbol{\Sigma}\textbf{A}’\right)$.
Now, letting $\textbf{L}$ be the lower triangular Cholesky decomposition of $\boldsymbol{\Sigma}$, such that $\textbf{LL}’ = \boldsymbol{\Sigma}$, then
$\begin{align*}
\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\boldsymbol{\Sigma}\right)\\
\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{LL}’\right)\\
\textbf{L}^{-1}\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{L}^{-1}\textbf{LL}’\textbf{L}’^{-1}\right)\\
\textbf{L}^{-1}\textbf{x} &\sim \mathcal{N}_p\left(\textbf{0},\textbf{I}\right)
\end{align*}$
Therefore, if we let $\textbf{z} = \textbf{L}^{-1}\textbf{x}$, then $z_i$ are iid normal, and the distribution of the squared Mahalanobis distance is
$\begin{align*}
m^2 &= \textbf{x}’\boldsymbol{\Sigma}^{-1}\textbf{x}\\
&= \textbf{x}’\left(\textbf{LL}’\right)^{-1}\textbf{x}\\
&= \textbf{x}’\textbf{L}’^{-1}\textbf{L}^{-1}\textbf{x}\\
&= \left(\textbf{L}^{-1}\textbf{x}\right)’\textbf{L}^{-1}\textbf{x}\\
&= \textbf{z}’\textbf{z}\\
&= z_1^2 + z_2^2 + \dots + z_p^2\\
m &\sim \mathcal{X}^2_p.
\end{align*}$
So, let’s see how well that holds up, starting with a simple hypothesis test. Not that I condone null hypothesis significance testing. But they are an easy smell.
m2 <- BigPropPoints[4,]
ks.test(m2, "pchisq", 3)
Uh oh. That does not look good. Lets plot it:
theme_set(theme_bw())
ggplot(data.frame(mahalanobis = m2)) +
geom_density(aes(mahalanobis), color = "red") +
stat_function(fun = dchisq, args = list(df = 3), color = "blue", n = 1001) +
theme_bw() +
xlim(c(0,30))
The person who thinks that looks like a good fit drank too much wood alcohol.
The routine for determining the probability of collision requires multivariate normality. So we cannot abandon normality easily. And besides, why would we ever want something other than a normal distribution?
The logical route forward is to instead double down on normality: instead of using a multivariate normal distribution, we’ll use many multivariate normal distributions (in a mixture).
To simplify things, for now we will make the unnecessary assumption that the projected location is unbiased.
We will also rename the projected covariance $\textbf{S}$, to emphasize that it is not a true parameter, but another piece of data we are given; a fallible projection. (Following the convention set by Fisher, where parameters are designated by Greek letters, and data by the modern English/Latin alphabet).
Now, let
- $\textbf{x}$ be the miss distance
- $\textbf{L}$ be the lower triangular Cholesky factor of $\textbf{S}$
- $\textbf{y} = \textbf{L}^{-1}\textbf{x}$.
Then, we can model $\textbf{y}$ as coming from a mixture of multivariate normal distributions of mean $\textbf{0}$, and covariance $\boldsymbol{\Sigma}_g$. Additionally, let
- There be $G$ groups.
- $f\left(\textbf{y}|\boldsymbol{\mu},\boldsymbol{\Sigma}\right)$ be the multivariate normal pdf.
- $\rho_g > 0, g = 1,\ldots,G,\sum_{g=1}^G\rho_g = 1$ be the probabilities of group membership.
Then the pdf of $\textbf{y}$ is
\begin{align*}
f\left(\textbf{y}|\textbf{0},\boldsymbol{\rho}_{\textbf{g}},\boldsymbol{\Sigma}_{\textbf{g}}\right)
&=
\sum_{g=1}^G \rho_g f\left(\textbf{y}|\textbf{0}, \boldsymbol{\Sigma}_{g} \right).
\end{align*}
We can give $\boldsymbol{\Sigma}_g$ an Inverse Wishart prior distribution.
Now lets create y, and define a simple JAGS model:
process_bpp_apply <- function(BPP){
apply(BPP, 2, function(x){
S <- matrix(
c(
x[5],x[6],x[8],
x[6],x[7],x[9],
x[8],x[9],x[10]
), ncol = 3
)
forwardsolve(t(chol(S)), x[1:3])
})
}
y <- process_bpp_apply(BigPropPoints)
satelliteCode <- "model {
pi[1:G] ~ ddirich(alpha[1:G])
for (n in 1:N){
groups[n] ~ dcat(pi[1:G])
y[1:3,n] ~ dmnorm(zero[1:3], Lambda[1:3, 1:3, groups[n]])
}
for (g in 1:G){
Lambda[1:3,1:3,g] ~ dwish(R[1:3,1:3], nu)
RevCholWishart[3,3,g] <- sqrt(Lambda[3,3,g])
CholInvWishart[3,3,g] <- 1 / RevCholWishart[3,3,g]
RevCholWishart[3,1,g] <- Lambda[3,1,g] * CholInvWishart[3,3,g]
RevCholWishart[3,2,g] <- Lambda[3,2,g] * CholInvWishart[3,3,g]
RevCholWishart[2,2,g] <- sqrt(Lambda[2,2,g] - RevCholWishart[3,2,g]^2)
CholInvWishart[2,2,g] <- 1 / RevCholWishart[2,2,g]
RevCholWishart[2,1,g] <- (Lambda[2,1,g] - RevCholWishart[3,1,g]*RevCholWishart[3,2,g]) * CholInvWishart[2,2,g]
RevCholWishart[1,1,g] <- sqrt(Lambda[1,1,g] - RevCholWishart[2,1,g]^2 - RevCholWishart[3,1,g]^2)
CholInvWishart[1,1,g] <- 1 / RevCholWishart[1,1,g]
RevCholWishart[1,2,g] <- 0
RevCholWishart[1,3,g] <- 0
RevCholWishart[2,3,g] <- 0
CholInvWishart[1,2,g] <- 0
CholInvWishart[1,3,g] <- 0
CholInvWishart[2,3,g] <- 0
CholInvWishart[2,1,g] <- - CholInvWishart[2,2,g] * RevCholWishart[2,1,g] * CholInvWishart[1,1,g]
CholInvWishart[3,1,g] <- RevCholWishart[3,3,g] * ( RevCholWishart[3,1,g]*RevCholWishart[1,1,g] + RevCholWishart[3,2,g]*RevCholWishart[2,1,g] )
CholInvWishart[3,2,g] <- - RevCholWishart[3,3,g] * RevCholWishart[3,2,g] * RevCholWishart[2,2,g]
}
}"
Please pardon the messy calculations of RevCholWishart and CholInvWishart.
The analysis I want to do requires them, so we may as well go ahead and calculate them during the sampler.
They are the Cholesky factor of the Inverse-Wishart distribution (CholInvWishart), and that matrice’s inverse (RevCholWishart; “reverse Cholesky”, because it is a lower triangular matrix $\textbf{R}$ such that $\textbf{R}’\textbf{R} = \boldsymbol{\Lambda}$, instead of the usual Cholesky factor $\textbf{L}$ that is a lower triangualr matrix such that $\textbf{LL}’ = \boldsymbol{\Lambda}$.). The above code calculates them withought needing to calculate the InverseWishart.
One final point on mixture distributions: “label switching” is a problem you always have to consider with finite mixtures. Label switching can make it hard to get estimates of what individual componets are like, and can reak havoc on convergence diagnostics.
Imagine you have univariate data, and two groups. One has mean $-2$, and the other has mean $2$. If you fit a mixture model with groups “A” and “B”, and draw 10,000 samples, it may be that for the first 5000 samples group “A” is the group with mean $-2$, and group “B” is the one with mean $2$. But then that half way through, they switch labels. Then your posterior mean for both would be about $0$, and would hurt autocorrelation: the sample’s overall variance would be inflated by a sort of “Bernoulli factor” ( $p(1-p)$ * (upper mean – lower mean) ). If this is large relative to the variability of the groups, it would result in large autocorrelations, even if those autocorrelations are small during all subsets of the chain without switching.
The problem gets more prevalent when you have multiple independent chains, that started with different initial values (as the chains should).
Thankfully, in our case, the purpose of our analysis is not to draw inference on the distributions of the mixture components, but on the entire mixture. Our assessment of convergence may still be frustrated, but our interests are label-agnostic.
G <- 6L
N <- ncol(y)
R <- diag(rep(1,3))
nu <- 5
zero <- rep(0, 3)
alpha <- sapply(1:G, function(g) 1.5^(G-g))
PES <- sum(alpha)
list(alpha = alpha, PES = PES, p = signif(alpha / PES, digit=5))
data <- list(y = y,
G = G,
N = N,
R = R,
nu = nu,
zero = zero,
alpha = alpha)
monitor <- c("RevCholWishart", "CholInvWishart", "pi")
library(runjags)
system.time(
probColSamples1chain <- run.jags(
satelliteCode,
data = data, monitor = monitor, n.chains = 1,
burnin = 4000, sample = 10000
)
)
summary(probColSamples1chain)
probcol_df <- as.data.frame(probColSamples1chain$mcmc[[1]])
c(
ESS_rcw111 = effective_sample_size(probcol_df[["RevCholWishart[1,1,1]"]]),
ESS_rcw211 = effective_sample_size(probcol_df[["RevCholWishart[2,1,1]"]]),
ESS_rcw311 = effective_sample_size(probcol_df[["RevCholWishart[3,1,1]"]]),
ESS_rcw221 = effective_sample_size(probcol_df[["RevCholWishart[2,2,1]"]]),
ESS_rcw321 = effective_sample_size(probcol_df[["RevCholWishart[3,2,1]"]]),
ESS_rcw331 = effective_sample_size(probcol_df[["RevCholWishart[3,3,1]"]])
)
Took 138 seconds, cool. We’ll take a look at some of these samples later to assess how well it fits.
Regardless, we have to analyze tens of thousands of such data sets. Analyzing the data sets in parallel, it would still take a very long time.
Can we write a more efficient sampler? Of course!
We can split up our parameters into three blocks:
- The (Inverse-)Wisharts $(\boldsymbol{\Sigma})\ \boldsymbol{\Lambda}$. Conditioning on the group membership of each $\textbf{y}_n$, the (Inverse-)Wisharts are conjugate with the normal distribution on $\textbf{y}_n$.
- The proportion beloning to each group, $\boldsymbol{\pi}$. Conditioning on the number of observations in each group, the Dirichlet-categorical distribution is conjugate.
- Group memberships. Conditioning on the proportions and inverse wisharts of each group, we can calculate the individual likelihood of each $\textbf{y}$ belonging in each group. Categorical distributions are straightforward to sample from.
Further, we can also simplify the model by marginalizing out the (Inverse-)Wisharts, and replacing the multivariate normal with a multivariate t distribution.
So, let’s write our own and see how it does.