Probabilistic Inference Using Markov Chain Monte Carlo Methods
Probabilistic inference is an attractive approach to uncertain reasoning and empirical
learning in artificial intelligence. Computational difficulties arise, however,
because probabilistic models with the necessary realism and
exibility lead to complex
distributions over high-dimensional spaces.
Probabilistic Inference Using
Markov Chain Monte Carlo Methods
Radford M. Neal
Technical Report CRG-TR-93-1
Department of Computer Science
University of Toronto
E-mail: [email protected]
25 September 1993
Copyright 1993 by Radford M. Neal
c
Abstract
Probabilistic inference is an attractive approach to uncertain reasoning and em-
pirical learning in articial intelligence. Computational diculties arise, however,
because probabilistic models with the necessary realism and exibility lead to com-
plex distributions over high-dimensional spaces.
Related problems in other elds have been tackled using Monte Carlo methods based
on sampling using Markov chains, providing a rich array of techniques that can be
applied to problems in articial intelligence. The \Metropolis algorithm" has been
used to solve dicult problems in statistical physics for over forty years, and, in the
last few years, the related method of \Gibbs sampling" has been applied to problems
of statistical inference. Concurrently, an alternative method for solving problems
in statistical physics by means of dynamical simulation has been developed as well,
and has recently been unied with the Metropolis algorithm to produce the \hybrid
Monte Carlo" method. In computer science, Markov chain sampling is the basis
of the heuristic optimization technique of \simulated annealing", and has recently
been used in randomized algorithms for approximate counting of large sets.
In this review, I outline the role of probabilistic inference in articial intelligence,
present the theory of Markov chains, and describe various Markov chain Monte
Carlo algorithms, along with a number of supporting techniques. I try to present a
comprehensive picture of the range of methods that have been developed, including
techniques from the varied literature that have not yet seen wide application in
articial intelligence, but which appear relevant. As illustrative examples, I use the
problems of probabilistic inference in expert systems, discovery of latent classes from
data, and Bayesian learning for neural networks.
Acknowledgements
I thank David MacKay, Richard Mann, Chris Williams, and the members of my
Ph.D committee, Georey Hinton, Rudi Mathon, Demetri Terzopoulos, and Rob
Tibshirani, for their helpful comments on this review. This work was supported
by the Natural Sciences and Engineering Research Council of Canada and by the
Ontario Information Technology Research Centre.
Contents
1. Introduction 1
2. Probabilistic Inference for Articial Intelligence 4
2.1 Probabilistic inference with a fully-specied model : : : : : : : : : : : : : : : 5
2.2 Statistical inference for model parameters : : : : : : : : : : : : : : : : : : : : 13
2.3 Bayesian model comparison : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23
2.4 Statistical physics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25
3. Background on the Problem and its Solution 30
3.1 Denition of the problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30
3.2 Approaches to solving the problem : : : : : : : : : : : : : : : : : : : : : : : : 32
3.3 Theory of Markov chains : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36
4. The Metropolis and Gibbs Sampling Algorithms 47
4.1 Gibbs sampling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47
4.2 The Metropolis algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
4.3 Variations on the Metropolis algorithm : : : : : : : : : : : : : : : : : : : : : : 59
4.4 Analysis of the Metropolis and Gibbs sampling algorithms : : : : : : : : : : : 64
5. The Dynamical and Hybrid Monte Carlo Methods 70
5.1 The stochastic dynamics method : : : : : : : : : : : : : : : : : : : : : : : : : 70
5.2 The hybrid Monte Carlo algorithm : : : : : : : : : : : : : : : : : : : : : : : : 77
5.3 Other dynamical methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
5.4 Analysis of the hybrid Monte Carlo algorithm : : : : : : : : : : : : : : : : : : 83
6. Extensions and Renements 87
6.1 Simulated annealing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87
6.2 Free energy estimation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 94
6.3 Error assessment and reduction : : : : : : : : : : : : : : : : : : : : : : : : : : 102
6.4 Parallel implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114
7. Directions for Research 116
7.1 Improvements in the algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : 116
7.2 Scope for applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118
8. Annotated Bibliography 121
Index to Examples
De- Statistical Gibbs Metropolis Stochastic Hybrid
Type of model nition Inference Sampling Algorithm Dynamics Monte Carlo
Gaussian distribution 9 15, 19 64 83, 84
Latent class model 10 21 51 POS POS POS
Belief network 10 * 50 POS NA NA
Multi-layer perceptron 12 16, 19, 22, 35 INF 58 77 81
2D Ising model 26 NA 49 57 NA NA
Lennard-Jonesium 27 NA INF 57 76 POS
NA { Not applicable, INF { Probably infeasible, POS { Possible, but not discussed
Statistical inference for the parameters of belief networks is quite possible, but this
review deals only with inference for the values of discrete variables in the network.
1. Introduction
Probability is a well-understood method of representing uncertain knowledge and reasoning
to uncertain conclusions. It is applicable to low-level tasks such as perception, and to high-
level tasks such as planning. In the Bayesian framework, learning the probabilistic models
needed for such tasks from empirical data is also considered a problem of probabilistic in-
ference, in a larger space that encompasses various possible models and their parameter
values. To tackle the complex problems that arise in articial intelligence, exible meth-
ods for formulating models are needed. Techniques that have been found useful include
the specication of dependencies using \belief networks", approximation of functions using
\neural networks", the introduction of unobservable \latent variables", and the hierarchical
formulation of models using \hyperparameters".
Such exible models come with a price however. The probability distributions they give rise
to can be very complex, with probabilities varying greatly over a high-dimensional space.
There may be no way to usefully characterize such distributions analytically. Often, however,
a sample of points drawn from such a distribution can provide a satisfactory picture of it.
In particular, from such a sample we can obtain Monte Carlo estimates for the expectations
of various functions of the variables. Suppose X = fX1; . . . ; Xn g is the set of random
variables that characterize the situation being modeled, taking on values usually written as
x1; . . . ; xn, or some typographical variation thereon. These variables might, for example,
represent parameters of the model, hidden features of the objects modeled, or features of
objects that may be observed in the future. The expectation of a function a(X1 ; . . . ; Xn )
| it's average value with respect to the distribution over X | can be approximated by
X X
hai = a(~1 ; . . . ; xn) P (X1 = x1; . . . ; Xn = xn)
x ~ ~ ~ (1.1)
x1
~ xn
~
1N 1
X
N a(x(1t); . . . ; x(nt)) (1.2)
t=0
where x(t) ; . . . ; x(t) are the values for the t-th point in a sample of size N . (As above, I will
1 n
often distinguish variables in summations using tildes.) Problems of prediction and decision
can generally be formulated in terms of nding such expectations.
Generating samples from the complex distributions encountered in articial intelligence
applications is often not easy, however. Typically, most of the probability is concentrated
in regions whose volume is a tiny fraction of the total. To generate points drawn from
the distribution with reasonable eciency, the sampling procedure must search for these
relevant regions. It must do so, moreover, in a fashion that does not bias the results.
Sampling methods based on Markov chains incorporate the required search aspect in a
framework where it can be proved that the correct distribution is generated, at least in
the limit as the length of the chain grows. Writing X (t) = fX1t) ; . . . ; Xnt) g for the set of
( (
variables at step t, the chain is de