Markov Chain Monte Carlo (MCMC) methods are a popular way of solving the math problems in Bayesian statistics.

Unfortunately, while popular packages for doing MCMC can be convenient, they can sometimes seem like opaque black boxes.

So, let’s write our own Gibbs samplers to have something to play with, and try to understand what is going on, and how popular packages like JAGS (Just Another Gibbs Sampler), NIMBLE, and BUGS (Bayesian inference Using Gibbs Sampling.

This code was writtenand run from a Jupyter notebook. You can copy and paste the code, or ask me for the ipynb file and install Jupyter notebook, and then from within `R`

, run `install.packages("IRkernel")`

and `IRkernel::installspec()`

. Then you can launch the notebook (`jupyter notebook`

from the command line) and run the file within a browser.

First, why MCMC? Integration is hard.

If you would like a Bayesian posterior, you need to solve

\begin{align}

\pi(\boldsymbol{\theta}|y) = \frac{ \pi(\boldsymbol{\theta})f(\textbf{y}|\boldsymbol{\theta}) }{ \int_{-\infty}^{\infty} \pi(\theta)f(\textbf{y}|\boldsymbol{\theta}) d\boldsymbol{\theta} }.

\end{align}

The integral in the denominator is normally intractable. Furthermore, if you want to answer useful questions with that posterior, you’ll often find yourself needing to solve problems like:

\begin{align}

0.025 &= \int_{-\infty}^l \int_{-\infty}^{\infty} \pi(\boldsymbol{\theta}|\textbf{y})d\boldsymbol{\theta}_{-k} d\theta_k\\

0.025 &= \int_u^{\infty} \int_{-\infty}^{\infty} \pi(\boldsymbol{\theta}|\textbf{y})d\boldsymbol{\theta}_{-k} d\theta_k

\end{align}

that is, marginalize out all parameters except $\theta_k$, and then solve for a (lower) $l$ bound of $\theta_k$ such that $2.5\%$ of the probability mass lies below it, and a similar upper bound, $u$.

That is a lot of integration.

We need numerical techniques. Markov Chain Monte Carlo techniques are among the best and the most flexible $-$ especially for high dimensional problems.

They let us draw samples from a posterior distribution. These samples give us an emperical cdf we can use to approximate the actual distribution. Like all numerical techniques, MCMC isn’t perfect and the results have error. But because the results are in the form of samples describing a population (the true posterior), we as statisticians are hopefully comfortable assessing how well that sample (MCMC samples) really describes the “population”.

Typically, we run the sampler long enough to get a big sample size and thus a good approximation.

Just make sure not to confuse the data samples and population, from the mcmc samples and population/true posterior. Running MCMC for longer isn’t going to help with noise from bad data.

So, what is Gibbs sampling?

Gibbs sampling, in a nutshell, lets you break up sampling from a complicated model into pieces. If the full parameter vector is $\boldsymbol{\theta}$, we’ll have $P$ pieces, $\boldsymbol{\theta}_p,p=1,\ldots,P$.

To sample $\boldsymbol{\theta}$, all you’ll have to do is sample from each $\boldsymbol{\theta}_p$, conditioning on the remaining parameters $\boldsymbol{\theta}_{-p}$.

We’ll use a superscript, $\boldsymbol{\theta}^i$, to indicate the mcmc sample number. So, say we have sample $\boldsymbol{\theta}^{i-1}$. Given this sample we can draw sample $\boldsymbol{\theta}^{i}$ by:

- Sampling $\boldsymbol{\theta}_1$ conditioning on $\boldsymbol{\theta}_{-1}^{i-1}$
- Sampling $\boldsymbol{\theta}_2$ conditioning on $\boldsymbol{\theta}_1^{i},\boldsymbol{\theta}_{3,\ldots,P}^{i-1}$
- Sampling $\boldsymbol{\theta}_3$ conditioning on $\boldsymbol{\theta}_{1,2}^{i},\boldsymbol{\theta}_{4,\ldots,P}^{i-1}$
- Sampling…

etc. If sample $i-1$ was from the posterior, then the $i$th will as well. However, the samples will *not* be iid, because they’re conditioned on previous observations. Additionally, that means if $\boldsymbol{\theta}^i$ isn’t representative of the posterior $-$ for example, perhaps because it was a randomly generated initial value you used to start the chain $-$ then $\boldsymbol{\theta}^{i+1}$ is not necessarilly either. It can take a while for the chain to move to representative, “valid” values. This is one reason why it is common to discard the early iterations as a “warmup”. Another reason is that sometimes models need to adapt.

Lets take a close look at a very simple example: the beta-binomial. The beta binomial is:

\begin{align*}

y &\sim \text{Binomial}(n, \theta)\\

\theta &\sim \text{Beta}(\alpha, \beta).

\end{align*}

What are the distributions of $y$ and $\theta$, given $n, \alpha,$ and $\beta$?

First of all, $f(y|n,\theta)$ is simply a Binomial distribtion. So to sample $y$, we sample from a Binomial$(n,\theta)$.

Slightly more difficult, we need the distribution of $\theta$ given $y$.

\begin{align*}

f(\theta | y) &\propto f(y | \theta) f(\theta)\\

&= \theta^{y}(1-\theta)^{n-y}\theta^{\alpha – 1}(1-\theta)^{n-y}\\

&= \theta^{\alpha + y – 1}(1-\theta)^{n – y + \beta – 1}.

\end{align*}

Therefore, we can recognize this as the kernel of a Beta$(\alpha + y, n – y + \beta)$ distribution.

So, to draw sample $y^i, \theta^i$ given $y^{i-1},\theta^{i-1}$, we’d only need to

- sample $y^i$ from a Binomial$(n, \theta^{i-1})$.
- sample $\theta^i$ from a Beta$(\alpha + y, n – y + \beta)$.

We can implement this with a simple `for`

loop in R. To start this chain, we’ll randomly initialize $\theta$ with a sample from a random uniform. Sampling, using $\alpha = 3, \beta = 2$:

```
rand_BetaBinomial <- function(iter, N, alpha, beta, warmup = 1000L, theta_init = runif(1)){
# We preallocate the theta and y vectors.
# "Preallocate" means to create the vectors ("allocating" their memory),
# so that we can fill them in as we go. That is much faster than repeating
# `theta <- c(theta, new_sample)`
# in a for loop.
theta <- double(iter)
y <- integer(iter)
# Now, we sample our first y from the initial value of theta
y[1] <- rbinom(1,N, theta_init)
# Run a for loop for "warmup" iterations. Here, we're just overwriting our old values, instead of saving them.
for (i in 1:max(1L,warmup)){
theta[1] <- rbeta(1, y[1] + alpha, N - y[1] + beta)
y[1] <- rbinom(1, N, theta[1])
}
# Now we sample. Warmups were set to be at least 1, so that we can start our posterior sampling at 2.
# we want to start at 2, so that we can use "i-1", to reference the previous iteration.
for (i in 2:iter){
theta[i] <- rbeta(1, y[i-1] + alpha, N - y[i-1] + beta)
y[i] <- rbinom(1, N, theta[i])
}
data.frame(theta = theta, y = y)
}
set.seed(1)
system.time(beta_binom <- rand_BetaBinomial(1e5, 20, 3, 2))
```

Neat. We can get a hundred thousand samples in a negligible amount of practical time.

Let’s focus on $y$, because it has the beta-binomial distribution:

```
library(tidyverse)
y_distribution <- beta_binom %>%
select(y) %>%
group_by(y) %>%
tally() %>%
mutate(MCMC = n / sum(n))
y_distribution
```

```
beta_binom %>%
summarize(
mean_theta = mean(theta),
stdev_theta = sd(theta),
mean_y = mean(y),
stdev_y = sd(y)
)
```

Lets compare these frequencies with the actual $pmf$ using `dbetabinom`

from `rmutil`

. That function uses the overdispersion parameterization, where

\begin{align*}

s &= \alpha + \beta\\

m &= \frac{\alpha}{\alpha + \beta}.

\end{align*}

so that, given $\alpha = 3, \beta = 2$, we have $s = 5, m = 0.6$.

```
library(rmutil)
theme_set(theme_bw())
options(width = 150)
y_summary <- y_distribution %>% mutate(
PMF = dbetabinom(y, 20, 0.6, 5)
) %>%
gather(estimate, probability, MCMC, PMF)
ggplot(y_summary) + geom_col(aes(y, probability, fill = estimate), position = position_dodge())
```

Not bad looking.

Lets dive into this a little more. A familiar diagonostic is effective sample size.

A commonly used formula is $n_{eff} = \frac{n}{1 + 2\sum_{n=1}^{n-1}\rho_n}$.

The function `acf`

starts with lag 0, where the autocorrelation is 1 (everything has a 1 to 1 correlation with themselves), so we’ll subtract it off in the formula below.

```
effective_sample_size <- function(x) length(x) / (2*sum(acf(x, plot = FALSE)$acf) - 1)
c(
SS = length(beta_binom$y),
ESS_y = effective_sample_size(beta_binom$y),
ESS_theta = effective_sample_size(beta_binom$theta)
)
```

So while we actually had a hundred thousand samples, the effective sample size is around eleven thousand.

We can see the autocorrelation by picking random slices of the samples, and watch y “walk”:

```
beta_binom$iter <- 1:nrow(beta_binom)
ggplot(beta_binom[300:340,]) + geom_line(aes(iter, y))
```

Lets also take a look at convergence. Lets pick a rubbish starting value, and watch what happens early on:

```
system.time(beta_binom_converge <- rand_BetaBinomial(1e5, 20, 3, 2, warmup = 1, theta_init = 0.0001))
```

```
beta_binom_converge$iter <- 1:nrow(beta_binom_converge)
ggplot(beta_binom_converge[1:40,]) + geom_line(aes(iter, y))
```

```
ggplot(beta_binom_converge[1:40,]) + geom_line(aes(iter, theta))
```

Hopefully it is clear what is happening: the initial value of theta was close to 0, so that the first value of y was also small. A small sample of y leads to a small estimate for theta, to a small y…

But we can see it did not take long to get out of this rut. In fact, it converged very fast.

Within 10 iterations, $\theta$ and $y$ already exceeded their means of 0.6 and 12. The taint of our bad initial value is gone by 10 iterations: this model converges extremely quickly.

I made the function default to a warmup of 1000. In retrospect, that looks extremely excessive!

Finally, let’s run the model in JAGS for good measure.

```
library(runjags)
beta_binomial_JAGS_model <- "model {
theta ~ dbeta(alpha, beta)
y ~ dbinom(theta, N)
}"
system.time(
beta_binom_JAGS <- run.jags(
beta_binomial_JAGS_model,
data = list(N = 20, alpha = 3, beta = 2),
monitor = c("theta", "y"),
n.chains = 1, burnin = 500, sample = 1e5
)
)
summary(beta_binom_JAGS)
```

Lets update our old plot:

```
binom_jags <- data.frame(beta_binom_JAGS$mcmc[[1]])
jags_distribution <- binom_jags %>%
select(y) %>%
group_by(y) %>%
tally() %>%
mutate(
probability = n / sum(n),
estimate = rep("JAGS", length(n))
)
y_summary <- rbind(y_summary, jags_distribution)
ggplot(y_summary) + geom_col(aes(y, probability, fill = estimate), position = position_dodge())
```

Snazzy. Lets also see how JAG’s effective sample size estimate compares to our own:

```
c(
SS = nrow(binom_jags),
ESS_y = effective_sample_size(binom_jags$y),
ESS_theta = effective_sample_size(binom_jags$theta)
)
```

Clever girl! Those effective sample sizes are as big as the actual sample sizes. Lets take a look at JAG’s autocorrelation:

```
acf(binom_jags$theta)
```

Rather, the lack of it. Compare that to ours:

```
acf(beta_binom$theta)
```

It looks like JAGS wasn’t Gibbs sampling here at all. Instead, it identified the $\theta$-$y$ pair as conjugate priors that it could sample from jointly, rather than iteratively sampling between them. That sort of trick can be a boon for performance (particularly in effective sample size), especially when a model contains a lot of conjugate pairs. But it only works when there are conjugate priors with a distribution JAGS recognizes.

`rmutil`

of course lets us sample from the BetaBinomial as well, but it only lets us sample the `y`

. Doing that of course gives us zero autocorrelation as well:

```
system.time(acf(rbetabinom(1e5, 20, 0.6, 5)))
```

Note how much faster JAGS was for the same number of samples, with the bonus of returning theta as well! And being far more flexible. JAGS is a solid piece of software.

NIMBLE is probably similar, with the advantage of letting you define your own distributions and samplers for them, in case performance is important and you happen to know something NIMBLE does not $-$ which could often be the case when you get creative.