.. _chap_mcmc:

**********************************
Appendix: Markov Chain Monte Carlo
**********************************

Monte Carlo Methods in Bayesian Analysis
========================================

Bayesian analysis often requires integration over multiple dimensions that is 
intractable both via analytic methods or standard methods of numerical 
integration. However, it is often possible to compute these integrals by 
simulating (drawing samples) from posterior distributions. For example, 
consider the expected value of a random variable :math:`\mathbf{x}`:
   
.. math::
   E[{\bf x}] = \int {\bf x} f({\bf x}) d{\bf x}, \qquad
   {\bf x} = \{x_1,...,x_k\}

where :math:`k` (the dimension of vector :math:`x`) is perhaps very large. If 
we can produce a reasonable number of random vectors :math:`\{{\bf x_i}\}`, we 
can use these values to approximate the unknown integral. This process is known 
as *Monte Carlo integration*. In general, MC integration allows integrals 
against probability density functions:
   
.. math::
   I = \int h(\mathbf{x}) f(\mathbf{x}) \mathbf{dx}

to be estimated by finite sums:
   
.. math::
   \hat{I} = \frac{1}{n}\sum_{i=1}^n h(\mathbf{x}_i),

where :math:`\mathbf{x}_i` is a sample from :math:`f`. This estimate is valid 
and useful because:

* By the strong law of large numbers:

.. math::
     \hat{I} \rightarrow I   \mbox{   with probability 1}
     
* Simulation error can be measured and controlled:

     
.. math::
  :nowrap:

     \begin{equation}Var(\hat{I}) = \frac{1}{n(n-1)}\sum_{i=1}^n 
     (h(\mathbf{x}_i)-\hat{I})^2\end{equation}

Why is this relevant to Bayesian analysis? If we replace :math:`f(\mathbf{x})` 
with a posterior, :math:`f(\theta|d)` and make :math:`h(\theta)` an interesting 
function of the unknown parameter, the resulting expectation is that of the 
posterior of :math:`h(\theta)`:
   
.. math::
  :nowrap:

   \begin{equation}
   E[h(\theta)|d] = \int f(\theta|d) h(\theta) d\theta \approx \frac{1}{n}\sum_{i=1}^n h(\theta)
   \end{equation}


Rejection Sampling
------------------

Though Monte Carlo integration allows us to estimate integrals that are 
unassailable by analysis and standard numerical methods, it relies on the 
ability to draw samples from the posterior distribution. For known parametric 
forms, this is not a problem; probability integral transforms or bivariate 
techniques (e.g Box-Muller method) may be used to obtain samples from uniform 
pseudo-random variates generated from a computer. Often, however, we cannot 
readily generate random values from non-standard posteriors. In such instances, 
we can use rejection sampling to generate samples.

Posit a function, :math:`f(x)` which can be evaluated for any value on the 
support of :math:`x:S_x = [A,B]`, but may not be integrable or easily sampled 
from. If we can calculate the maximum value of :math:`f(x)`, we can then define 
a rectangle that is guaranteed to contain all possible values :math:`(x,f(x))`. 
It is then trivial to generate points over the box and enumerate the values 
that fall under the curve (Figure :ref:`bound`).

.. _bound:

.. figure:: _images/reject.*
   :alt: Rejection figure.
   :align: center
   :scale: 100
   
   Rejection sampling of a bounded form. Area is estimated by the ratio of
   accepted (open squares) to total points, multiplied by the rectangle
   area.
   
.. math::
   \frac{\mbox{Points under curve}}{\mbox{Points generated}} \times \mbox{box area} = \lim_{n \to \infty} \int_A^B f(x) dx

This approach is useful, for example, in estimating the normalizing constant 
for posterior distributions.

.. _envelope:

.. figure:: _images/envelope.*
   :alt: envelope figure
   :align: center
   :scale: 100
   
   Rejection sampling of an unbounded form using an enveloping distribution.

If :math:`f(x)` has unbounded support (i.e. infinite tails), such as a Gaussian 
distribution, a bounding box is no longer appropriate. We must specify a 
majorizing (or, enveloping) function, :math:`g(x)`, which implies:
   
.. math::
   g(x) \ge  f(x) \qquad\forall x \in (-\infty,\infty)

Having done this, we can now sample :math:`{x_i}` from :math:`g(x)` and accept 
or reject each of these values based upon :math:`f(x_i)`. Specifically, for 
each draw :math:`x_i`, we also draw a uniform random variate :math:`u_i` and 
accept :math:`x_i` if :math:`u_i < f(x_i)/cg(x_i)`, where :math:`c` is a 
constant (Figure :ref:`envelope`). This approach is made more efficient by 
choosing an enveloping distribution that is "close" to the target distribution, 
thus maximizing the number of accepted points. Further improvement is gained by 
using optimized algorithms such as importance sampling which, as the name 
implies, samples more frequently from important areas of the distribution.

Rejection sampling is usually subject to declining performance as the dimension 
of the parameter space increases, so it is used less frequently than MCMC for 
evaluation of posterior distributions [Gamerman1997]_.


Markov Chains
=============

A Markov chain is a special type of *stochastic process*. The standard 
definition of a stochastic process is an ordered collection of random variables:
   
.. math::
   \{X_t:t \in T\}

where :math:`t` is frequently (but not necessarily) a time index. If we think 
of :math:`X_t` as a state :math:`X` at time :math:`t`, and invoke the following 
dependence condition on each state:
   
.. math::
   Pr(X_{t+1}=x_{t+1} | X_t=x_t, X_{t-1}=x_{t-1},\ldots,X_0=x_0) = Pr(X_{t+1}=x_{t+1} | X_t=x_t)

then the stochastic process is known as a Markov chain. This conditioning 
specifies that the future depends on the current state, but not past states. 
Thus, the Markov chain wanders about the state space, remembering only where it 
has just been in the last time step. The collection of transition probabilities 
is sometimes called a *transition matrix* when dealing with discrete states, or 
more generally, a *transition kernel*.

In the context of Markov chain Monte Carlo, it is useful to think of the 
Markovian property as "mild non-independence". MCMC allows us to indirectly 
generate independent samples from a particular posterior distribution.

Jargon-busting
--------------

Before we move on, it is important to define some general properties of Markov 
chains. They are frequently encountered in the MCMC literature, and some will 
help us decide whether MCMC is producing a useful sample from the posterior.

* *Homogeneity*: A Markov chain is homogeneous at step :math:`t` if the
  	transition probabilities are independent of time :math:`t`.

* *Irreducibility*: A Markov chain is irreducible if every state is accessible
  	in one or more steps from any other state. That is, the chain contains no
	absorbing states. This implies that there is a non-zero probability of
	eventually reaching state :math:`k` from any other state in the chain.

* *Recurrence*: States which are visited repeatedly are *recurrent*. If the
  	expected time to return to a particular state is bounded, this is known as
	*positive recurrence*, otherwise the recurrent state is *null recurrent*.
	Further, a chain is *Harris recurrent* when it visits all states :math:`X 
	\in S` infinitely often in the limit as :math:`t \to \infty`; this is an 
	important characteristic when dealing with unbounded, continuous state 
	spaces. Whenever a chain ends up in a closed, irreducible set of Harris 
	recurrent states, it stays there forever and visits every state with 
	probability one.

* *Stationarity*: A stationary Markov chain produces the same marginal
  	distribution when multiplied by the transition kernel.  Thus, if :math:`P` 	
	is some :math:`n \times n` transition matrix:

     
   .. math::
      {\bf \pi P} = {\bf \pi}
     
	for Markov chain :math:`\pi`. Thus, :math:`\pi` is no longer subscripted, 
	and is referred to as the *limiting distribution* of the chain. In MCMC, 
	the chain explores the state space according to its limiting marginal 
	distribution.

* *Ergodicity*: Ergodicity is an emergent property of Markov chains which are
  	irreducible, positive Harris recurrent and aperiodic. Ergodicity is defined 	
	as:

     
  .. math::
      \lim_{n \to \infty} Pr^{(n)}(\theta_i \rightarrow \theta_j) = \pi(\theta) \quad \forall \theta_i, \theta_j \in \Theta

	or in words, after many steps the marginal distribution of the chain is the 
	same at one step as at all other steps. This implies that our Markov chain, 
	which we recall is dependent, can generate samples that are independent if 
	we wait long enough between samples. If it means anything to you, 
	ergodicity is the analogue of the strong law of large numbers for Markov 
	chains. For example, take values :math:`\theta_{i+1},\ldots,\theta_{i+n}` 
	from a chain that has reached an ergodic state. A statistic of interest can 
	then be estimated by:

     
  .. math::
     \frac{1}{n}\sum_{j=i+1}^{i+n} h(\theta_j) \approx \int f(\theta) h(\theta) d\theta




Why MCMC Works: Reversible Markov Chains
========================================

Markov chain Monte Carlo simulates a Markov chain for which some function of
interest (*e.g.* the joint distribution of the parameters of some model) is the
unique, invariant limiting distribution. An invariant distribution with respect
to some Markov chain with transition kernel :math:`Pr(y \mid x)` implies that:

   
.. math::
   \int_x Pr(y \mid x) \pi(x) dx = \pi(y).

Invariance is guaranteed for any **reversible** Markov chain. Consider a Markov
chain in reverse sequence:
:math:`\{\theta^{(n)},\theta^{(n-1)},...,\theta^{(0)}\}`. This sequence is still
Markovian, because:

   
.. math::
   Pr(\theta^{(k)}=y \mid \theta^{(k+1)}=x,\theta^{(k+2)}=x_1,\ldots ) = Pr(\theta^{(k)}=y \mid \theta^{(k+1)}=x)

Forward and reverse transition probabilities may be related through Bayes
theorem:

.. math::
   
.. \begin{eqnarray}
.. Pr(\theta^{(k)}=y \mid \theta^{(k+1)}=x) &=& \frac{Pr(\theta^{(k+1)}=x \mid \theta^{(k)}=y) Pr(\theta^{(k)}=y)}{Pr(\theta^{(k+1)}=x)}  \\
.. &=& \frac{Pr(\theta^{(k+1)}=x \mid \theta^{(k)}=y) \pi^{(k)}(y)}{\pi^{(k+1)}(x)}
.. \end{eqnarray}

   
.. math::
   \frac{Pr(\theta^{(k+1)}=x \mid \theta^{(k)}=y) \pi^{(k)}(y)}{\pi^{(k+1)}(x)}

Though not homogeneous in general, :math:`\pi` becomes homogeneous if **Do you
ever call the stationary distribution itself homogeneous?**:

* :math:`n \rightarrow \infty`

* :math:`\pi^{(i)}=\pi` for some :math:`i < k` 

If this chain is homogeneous it is called reversible, because it satisfies the
**detailed balance equation**:

   
.. math::
   \pi(x)Pr(y \mid x) = \pi(y) Pr(x \mid y)

Reversibility is important because it has the effect of balancing movement
through the entire state space. When a Markov chain is reversible, :math:`\pi`
is the unique, invariant, stationary distribution of that chain. Hence, if
:math:`\pi` is of interest, we need only find the reversible Markov chain for
which :math:`\pi` is the limiting distribution. This is what MCMC does!


Gibbs Sampling
==============

The Gibbs sampler is the simplest and most prevalent MCMC algorithm. If a
posterior has :math:`k` parameters to be estimated, we may condition each
parameter on current values of the other :math:`k-1` parameters, and sample from
the resultant distributional form (usually easier), and repeat this operation on
the other parameters in turn. This procedure generates samples from the
posterior distribution. Note that we have now combined Markov chains
(conditional independence) and Monte Carlo techniques (estimation by simulation)
to yield Markov chain Monte Carlo.

Here is a stereotypical Gibbs sampling algorithm:

As we can see from the algorithm, each distribution is conditioned on the last
iteration of its chain values, constituting a Markov chain as advertised. The
Gibbs sampler has all of the important properties outlined in the previous
section: it is aperiodic, homogeneous and ergodic. Once the sampler converges,
all subsequent samples are from the target distribution. This convergence occurs
at a geometric rate.


#. Choose starting values for states (parameters): :math:`{\bf \theta} = [\theta_1^{(0)},\theta_2^{(0)},\ldots,\theta_k^{(0)}]`
#. Initialize counter :math:`j=1`
#. Draw the following values from each of the :math:`k` conditional distributions:

   .. math::
      :nowrap:
   
      \begin{eqnarray*}
      \theta_1^{(j)} &\sim& \pi(\theta_1 | \theta_2^{(j-1)},\theta_3^{(j-1)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\
      \theta_2^{(j)} &\sim& \pi(\theta_2 | \theta_1^{(j)},\theta_3^{(j-1)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\
      \theta_3^{(j)} &\sim& \pi(\theta_3 | \theta_1^{(j)},\theta_2^{(j)},\ldots,\theta_{k-1}^{(j-1)},\theta_k^{(j-1)}) \\
      \vdots \\
      \theta_{k-1}^{(j)} &\sim& \pi(\theta_{k-1} | \theta_1^{(j)},\theta_2^{(j)},\ldots,\theta_{k-2}^{(j)},\theta_k^{(j-1)}) \\
      \theta_k^{(j)} &\sim& \pi(\theta_k | \theta_1^{(j)},\theta_2^{(j)},\theta_4^{(j)},\ldots,\theta_{k-2}^{(j)},\theta_{k-1}^{(j)})
      \end{eqnarray*}

#. Increment :math:`j` and repeat until convergence occurs.


The Metropolis-Hastings Algorithm
=================================

The key to success in applying the Gibbs sampler to the estimation of Bayesian
posteriors is being able to specify the form of the complete conditionals of
:math:`{\bf \theta}`. In fact, the algorithm cannot be implemented without them.
Of course, the posterior conditionals cannot always be neatly specified. In
contrast to the Gibbs algorithm, the Metropolis-Hastings algorithm generates
candidate state transitions from an alternate distribution, and accepts or
rejects each candidate probabilistically.

Let us first consider a simple Metropolis-Hastings algorithm for a single
parameter, :math:`\theta`. We will use a standard sampling distribution,
referred to as the *proposal distribution*, to produce candidate variables
:math:`q_t(\theta^{\prime} | \theta)`. That is, the generated value,
:math:`\theta^{\prime}`, is a *possible* next value for :math:`\theta` at step
:math:`t+1`. We also need to be able to calculate the probability of moving back
to the original value from the candidate, or
:math:`q_t(\theta | \theta^{\prime})`. These probabilistic ingredients are used
to define an *acceptance ratio*:

   
.. math::
   a(\theta^{\prime},\theta) = \frac{q_t(\theta^{\prime} | \theta) \pi(\theta^{\prime})}{q_t(\theta | \theta^{\prime}) \pi(\theta)}

The value of :math:`\theta^{(t+1)}` is then determined by:

.. math::
   \theta^{(t+1)} = \left\{\begin{array}{l@{\quad \mbox{with prob.} \quad}l}\theta^{\prime} & \min(a(\theta^{\prime},\theta),1) \\ \theta^{(t)} & 1 - \min(a(\theta^{\prime},\theta),1) \end{array}\right.

This transition kernel implies that movement is not guaranteed at every step. It
only occurs if the suggested transition is likely based on the acceptance ratio.

A single iteration of the Metropolis-Hastings algorithm proceeds as follows:

The original form of the algorithm specified by Metropolis required that
:math:`q_t(\theta^{\prime} | \theta) = q_t(\theta | \theta^{\prime})`, which
reduces :math:`a(\theta^{\prime},\theta)` to
:math:`\pi(\theta^{\prime})/\pi(\theta)`, but this is not necessary. In either
case, the state moves to high-density points in the distribution with high
probability, and to low-density points with low probability. After convergence,
the Metropolis-Hastings algorithm describes the full target posterior density,
so all points are recurrent.


#. Sample :math:`\theta^{\prime}` from :math:`q(\theta^{\prime} | \theta^{(t)})`.
#. Generate a Uniform[0,1] random variate :math:`u`.
#. If :math:`a(\theta^{\prime},\theta) > u` then :math:`\theta^{(t+1)} = \theta^{\prime}`, otherwise :math:`\theta^{(t+1)} = \theta^{(t)}`.



Random-walk Metropolis-Hastings
-------------------------------

A practical implementation of the Metropolis-Hastings algorithm makes use of a
random-walk proposal. Recall that a random walk is a Markov chain that evolves
according to:



.. math::
  :nowrap:

  \begin{eqnarray*}
  \theta^{(t+1)} &=& \theta^{(t)} + \epsilon_t \\
  \epsilon_t &\sim& f(\phi)
  \end{eqnarray*}


As applied to the MCMC sampling, the random walk is used as a proposal
distribution, whereby dependent proposals are generated according to:

   
.. math::
   q(\theta^{\prime} | \theta^{(t)}) = f(\theta^{\prime} - \theta^{(t)}) = \theta^{(t)} + \epsilon_t

Generally, the density generating :math:`\epsilon_t` is symmetric about zero,
resulting in a symmetric chain. Chain symmetry implies that
:math:`q(\theta^{\prime} | \theta^{(t)}) = q(\theta^{(t)} | \theta^{\prime})`,
which reduces the Metropolis-Hastings acceptance ratio to:

   
.. math::
   a(\theta^{\prime},\theta) = \frac{\pi(\theta^{\prime})}{\pi(\theta)}

The choice of the random walk distribution for :math:`\epsilon_t` is frequently
a normal or Student's :math:`t` density, but it may be any distribution that
generates an irreducible proposal chain.

An important consideration is the specification of the scale parameter for the
random walk error distribution. Large values produce random walk steps that are
highly exploratory, but tend to produce proposal values in the tails of the
target distribution, potentially resulting in very small acceptance rates.
Conversely, small values tend to be accepted more frequently, since they tend to
produce proposals close to the current parameter value, but may result in chains
that mix very slowly. Some simulation studies suggest optimal acceptance rates
in the range of 20-50%. It is often worthwhile to optimize the proposal variance
by iteratively adjusting its value, according to observed acceptance rates early
in the MCMC simulation [Gamerman1997]_.