Probabilistic Inference Using Markov Chain Monte Carlo Methods

Probabilistic inference is an attractive approach to uncertain reasoning and empirical learning in artificial intelligence. Computational difficulties arise, however, because probabilistic models with the necessary realism and exibility lead to complex distributions over high-dimensional spaces.

Probabilistic Inference Using Markov Chain Monte Carlo Methods Radford M. Neal Technical Report CRG-TR-93-1 Department of Computer Science University of Toronto E-mail: [email protected] 25 September 1993 Copyright 1993 by Radford M. Neal c Abstract Probabilistic inference is an attractive approach to uncertain reasoning and em- pirical learning in articial intelligence. Computational diculties arise, however, because probabilistic models with the necessary realism and exibility lead to com- plex distributions over high-dimensional spaces. Related problems in other elds have been tackled using Monte Carlo methods based on sampling using Markov chains, providing a rich array of techniques that can be applied to problems in articial intelligence. The \Metropolis algorithm" has been used to solve dicult problems in statistical physics for over forty years, and, in the last few years, the related method of \Gibbs sampling" has been applied to problems of statistical inference. Concurrently, an alternative method for solving problems in statistical physics by means of dynamical simulation has been developed as well, and has recently been unied with the Metropolis algorithm to produce the \hybrid Monte Carlo" method. In computer science, Markov chain sampling is the basis of the heuristic optimization technique of \simulated annealing", and has recently been used in randomized algorithms for approximate counting of large sets. In this review, I outline the role of probabilistic inference in articial intelligence, present the theory of Markov chains, and describe various Markov chain Monte Carlo algorithms, along with a number of supporting techniques. I try to present a comprehensive picture of the range of methods that have been developed, including techniques from the varied literature that have not yet seen wide application in articial intelligence, but which appear relevant. As illustrative examples, I use the problems of probabilistic inference in expert systems, discovery of latent classes from data, and Bayesian learning for neural networks. Acknowledgements I thank David MacKay, Richard Mann, Chris Williams, and the members of my Ph.D committee, Georey Hinton, Rudi Mathon, Demetri Terzopoulos, and Rob Tibshirani, for their helpful comments on this review. This work was supported by the Natural Sciences and Engineering Research Council of Canada and by the Ontario Information Technology Research Centre. Contents 1. Introduction 1 2. Probabilistic Inference for Articial Intelligence 4 2.1 Probabilistic inference with a fully-specied model : : : : : : : : : : : : : : : 5 2.2 Statistical inference for model parameters : : : : : : : : : : : : : : : : : : : : 13 2.3 Bayesian model comparison : : : : : : : : : : : : : : : : : : : : : : : : : : : : 23 2.4 Statistical physics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 25 3. Background on the Problem and its Solution 30 3.1 Denition of the problem : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30 3.2 Approaches to solving the problem : : : : : : : : : : : : : : : : : : : : : : : : 32 3.3 Theory of Markov chains : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 4. The Metropolis and Gibbs Sampling Algorithms 47 4.1 Gibbs sampling : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 4.2 The Metropolis algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 4.3 Variations on the Metropolis algorithm : : : : : : : : : : : : : : : : : : : : : : 59 4.4 Analysis of the Metropolis and Gibbs sampling algorithms : : : : : : : : : : : 64 5. The Dynamical and Hybrid Monte Carlo Methods 70 5.1 The stochastic dynamics method : : : : : : : : : : : : : : : : : : : : : : : : : 70 5.2 The hybrid Monte Carlo algorithm : : : : : : : : : : : : : : : : : : : : : : : : 77 5.3 Other dynamical methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 5.4 Analysis of the hybrid Monte Carlo algorithm : : : : : : : : : : : : : : : : : : 83 6. Extensions and Renements 87 6.1 Simulated annealing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87 6.2 Free energy estimation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 94 6.3 Error assessment and reduction : : : : : : : : : : : : : : : : : : : : : : : : : : 102 6.4 Parallel implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 114 7. Directions for Research 116 7.1 Improvements in the algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : 116 7.2 Scope for applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118 8. Annotated Bibliography 121 Index to Examples De- Statistical Gibbs Metropolis Stochastic Hybrid Type of model nition Inference Sampling Algorithm Dynamics Monte Carlo Gaussian distribution 9 15, 19 64 83, 84 Latent class model 10 21 51 POS POS POS Belief network 10 * 50 POS NA NA Multi-layer perceptron 12 16, 19, 22, 35 INF 58 77 81 2D Ising model 26 NA 49 57 NA NA Lennard-Jonesium 27 NA INF 57 76 POS NA { Not applicable, INF { Probably infeasible, POS { Possible, but not discussed Statistical inference for the parameters of belief networks is quite possible, but this review deals only with inference for the values of discrete variables in the network. 1. Introduction Probability is a well-understood method of representing uncertain knowledge and reasoning to uncertain conclusions. It is applicable to low-level tasks such as perception, and to high- level tasks such as planning. In the Bayesian framework, learning the probabilistic models needed for such tasks from empirical data is also considered a problem of probabilistic in- ference, in a larger space that encompasses various possible models and their parameter values. To tackle the complex problems that arise in articial intelligence, exible meth- ods for formulating models are needed. Techniques that have been found useful include the specication of dependencies using \belief networks", approximation of functions using \neural networks", the introduction of unobservable \latent variables", and the hierarchical formulation of models using \hyperparameters". Such exible models come with a price however. The probability distributions they give rise to can be very complex, with probabilities varying greatly over a high-dimensional space. There may be no way to usefully characterize such distributions analytically. Often, however, a sample of points drawn from such a distribution can provide a satisfactory picture of it. In particular, from such a sample we can obtain Monte Carlo estimates for the expectations of various functions of the variables. Suppose X = fX1; . . . ; Xn g is the set of random variables that characterize the situation being modeled, taking on values usually written as x1; . . . ; xn, or some typographical variation thereon. These variables might, for example, represent parameters of the model, hidden features of the objects modeled, or features of objects that may be observed in the future. The expectation of a function a(X1 ; . . . ; Xn ) | it's average value with respect to the distribution over X | can be approximated by X X hai = a(~1 ; . . . ; xn) P (X1 = x1; . . . ; Xn = xn) x ~ ~ ~ (1.1) x1 ~ xn ~ 1N 1 X N a(x(1t); . . . ; x(nt)) (1.2) t=0 where x(t) ; . . . ; x(t) are the values for the t-th point in a sample of size N . (As above, I will 1 n often distinguish variables in summations using tildes.) Problems of prediction and decision can generally be formulated in terms of nding such expectations. Generating samples from the complex distributions encountered in articial intelligence applications is often not easy, however. Typically, most of the probability is concentrated in regions whose volume is a tiny fraction of the total. To generate points drawn from the distribution with reasonable eciency, the sampling procedure must search for these relevant regions. It must do so, moreover, in a fashion that does not bias the results. Sampling methods based on Markov chains incorporate the required search aspect in a framework where it can be proved that the correct distribution is generated, at least in the limit as the length of the chain grows. Writing X (t) = fX1t) ; . . . ; Xnt) g for the set of ( ( variables at step t, the chain is de

Tải về miễn phí