Metropolis Sampling

13.4 Metropolis Sampling

Metropolis sampling is a technique for generating a set of samples from a non-negative function that is distributed proportionally to ’s value (Metropolis et al. 1953). Remarkably, it is able to do so without requiring anything more than the ability to evaluate ; it is not necessary to be able to integrate , normalize the integral, and invert the resulting CDF. Furthermore, every iteration produces a usable sample generated from the function’s PDF; Metropolis sampling doesn’t share the shortcoming of rejection sampling that the number of iterations needed to obtain the next sample cannot be bounded. It can thus efficiently generate samples from a wider variety of functions than the techniques introduced in the previous section. It forms the foundation of the Metropolis light transport algorithm implemented in Section 16.4.

Metropolis sampling does have a few disadvantages: successive samples in the sequence are statistically correlated, and it is thus not possible to ensure that a small set of samples generated by Metropolis is well distributed across the domain. It’s only in the limit over a large number of samples that the samples will cover the domain. As such, the variance reduction advantages of techniques like stratified sampling (Section 13.8.1) are generally not available when using Metropolis sampling.

13.4.1 Basic Algorithm

More concretely, the Metropolis algorithm generates a set of samples from a function , which is defined over an arbitrary-dimensional state space (frequently, ) and returns a value in the reals, . After the first sample has been selected, each subsequent sample is generated by using a random mutation to to compute a proposed sample . The mutation may be accepted or rejected, and is accordingly set to either or . When these transitions from one state to another are chosen subject to a few requirements (to be described shortly), the distribution of values that results reaches an equilibrium distribution; this distribution is the stationary distribution. In the limit, the distribution of the set of samples is proportional to ’s probability density function .

In order to generate the correct distribution of samples, it is necessary to generate proposed mutations and then accept or reject the mutations subject to a few constraints. Assume that we have a mutation method that proposes changing from a given state into a new state (this might be done by perturbing in some way, or even by generating a completely new value). We must be able to compute a tentative transition function that gives the probability density of the mutation technique’s proposing a transition to , given that the current state is . (Section 13.4.2 will discuss considerations for designing transition functions.)

Given a transition function, it is possible to define an acceptance probability that gives the probability of accepting a proposed mutation from to in a way that ensures that the distribution of samples is proportional to . If the distribution is already in equilibrium, the transition density between any two states must be equal:

(13.6)

This property is called detailed balance.

Since and are set, Equation (13.6) tells us how must be defined. In particular, a definition of that maximizes the rate at which equilibrium is reached is

(13.7)

One thing to immediately notice from Equation (13.7) is that, if the transition probability density is the same in both directions, the acceptance probability simplifies to

(13.8)

Put together, we have the basic Metropolis sampling algorithm in pseudocode:

X = X0 for i = 1 to n X' = mutate(X) a = accept(X, X') if (random() < a) X = X' record(X)

This code generates samples by mutating the previous sample and computing acceptance probabilities as in Equation (13.7). Each sample can then be recorded in a data structure or used as a sample for integration.

Because the Metropolis algorithm naturally avoids parts of where ’s value is relatively low, few samples will be accumulated there. In order to get some information about ’s behavior in such regions, the expected values technique can be used to enhance the basic Metropolis algorithm. In this case, we still decide which state to transition into as before, but we record a sample at each of and , regardless of which one is selected by the acceptance criteria. Each of these recorded samples has a weight associated with it, where the weights are the probabilities for and for , where is the acceptance probability. Expected values doesn’t change the way we decide which state, or , to use at the next step; that part of the computation remains the same.

Updated pseudocode shows the idea:

X = X0 for i = 1 to n X' = mutate(X) a = accept(X, X') record(X, 1 - a) record(X', a) if (random() < a) X = X'

Comparing the two pieces of pseudocode, we can see that in the limit, the same weight distribution will be accumulated for and . Expected values more quickly give a smoother result and more information about the areas where is low than the basic algorithm does.

13.4.2 Choosing Mutation Strategies

In general, one has a lot of freedom in choosing mutation strategies, subject to being able to compute the tentative transition density . Recall from Equation (13.8) that if the transition densities are symmetric then it is not even necessary to be able to compute them to apply the Metropolis sampling algorithm. It’s easy to apply multiple mutation strategies, so if there are some that are effective in only some circumstances it doesn’t hurt to try using them as one of a set of approaches.

It is generally desirable that mutations propose large changes to the current sample rather than small ones. Doing so more quickly explores the state space rather than letting the sampler get stuck in a small region of it. However, when the function’s value is relatively large at the current sample , then it is likely that many proposed mutations will be rejected (consider the case where in Equation (13.8); will be very small). We’d like to avoid the case where many samples in a row are the same, again to better explore new parts of the state space: staying in the same state for many samples in a row leads to increased variance—intuitively, it makes sense that the more we move around , the better the overall results will be. For this case, small mutations are likely to propose samples where is still relatively large, leading to higher acceptance properties.

Thus, one useful mutation approach is to apply random perturbations to the current sample . If the sample is represented by a vector of real numbers , then some or all of the sample dimensions can be perturbed. One possibility is to perturb them by adding or subtracting a scaled random variable:

for some scale factor and where the operator wraps values around the boundaries to remain in . This method is symmetric, so we don’t need to compute the transition densities when using it with Metropolis sampling.

A related mutation approach is to just discard the current sample entirely and generate a new one with uniform random numbers:

(Note that this is also a symmetric method.) Occasionally generating a completely new sample in this manner is important since it ensures that we don’t get stuck in one part of the state space and never sample the rest of it. In general, it’s necessary that it be possible to reach all states where with nonzero probability (this property is called ergodicity). In particular, to ensure ergodicity it suffices that for all and where and .

Another approach is to use PDFs that match some part of the function being sampled. If we have a PDF that is similar to some component of , then we can use that to derive a mutation strategy by just drawing new samples . In this case, the transition function is straightforward:

In other words, the current state doesn’t matter for computing the transition density: we propose a transition into a state with a density that depends only on the newly proposed state and not at all on the current state.

13.4.3 Start-up Bias

One issue that we’ve sidestepped thus far is how the initial sample is computed. The transition and acceptance methods above tell us how to generate new samples , but all presuppose that the current sample has itself already been sampled with probability proportional to . Using a sample not from ’s distribution leads to a problem called start-up bias.

A common solution to this problem is to run the Metropolis sampling algorithm for some number of iterations from an arbitrary starting state, discard the samples that are generated, and then start the process for real, assuming that that has brought us to an appropriately sampled value. This is unsatisfying for two reasons: first, the expense of taking the samples that were then discarded may be high, and, second, we can only guess at how many initial samples must be taken in order to remove start-up bias.

An alternative approach can be used if another sampling method is available: an initial value is sampled using any density function . We start the Markov chain from the state , but we weight the contributions of all of the samples that we generate by the weight

This method eliminates start-up bias completely and does so in a predictable manner.

The only potential problem comes if for the we chose; in this case, all samples will have a weight of zero! This doesn’t mean that the algorithm is biased, however; the expected value of the result still converges to the correct distribution (see Veach (1997) for further discussion and for a proof of the correctness). To reduce variance and avoid this risk, we can instead sample a set of candidate sample values, , defining a weight for each by

We then choose the starting sample for the Metropolis algorithm from the with probability proportional to their relative weights and compute a sample weight as the average of all of the weights. All subsequent samples that are generated by the Metropolis algorithm are then weighted by the sample weight .

13.4.4 1D Setting

In order to illustrate some of the ideas in this section, we’ll show how Metropolis sampling can be used to sample a simple 1D function, defined over and 0 everywhere else (see Figure 13.7).

(13.9)

For this example, assume that we don’t actually know the exact form of —it’s just a black box that we can evaluate at particular values. (Clearly, if we knew that was just Equation (13.9), there’d be no need for Metropolis in order to draw samples from its distribution!)

Figure 13.7: Graph of the function used to illustrate Metropolis sampling in this section.

We’ll define two mutation strategies based on the ideas introduced in Section 13.4.2 and randomly choose among them each time a mutation is proposed, according to a desired distribution of how frequently each is to be used.

The first strategy, , discards the current sample and uniformly samples a new one, , from the entire state space . The transition function for this mutation is straightforward. For , since we are uniformly sampling over , the probability density is uniform over the entire domain; in this case, the density is just one everywhere. We have

The second mutation adds a random offset between to the current sample in an effort to sample repeatedly in the parts of that make a high contribution to the overall distribution. The transition probability density is 0 if and are far enough away that will never mutate from one to the other; otherwise, the density is constant. Normalizing the density so that it integrates to 1 over its domain gives the value . Both this and are symmetric, so the transition densities aren’t needed to implement the sampling algorithm.

To find the initial sample, we only need to take a single sample with a uniform PDF over , since except for a single point in for which there is zero probability of sampling:

The sample weight is then just .

We can now run the Metropolis algorithm and generate samples of . At each transition, we have two weighted samples to record (recall the expected values pseudocode from Section 13.4.1). A simple approach for reconstructing the approximation to ’s probability distribution is to store sums of the weights in a set of buckets of uniform width; each sample falls in a single bucket and contributes to it. Figure 13.8 shows some results. For both graphs, a chain of 10,000 mutations was followed, with the sample weights accumulated in 50 buckets over .

In the first graph, only was used. This alone isn’t a very effective mutation, since it doesn’t take advantage of the times when it has found a sample in a region of where has a relatively large value to generate additional samples in that neighborhood. However, the graph does suggest that the algorithm is converging to the correct distribution.

In the second, one of and was randomly chosen, with probabilities of 10% and 90%, respectively. We see that for the same number of samples taken, we converge to ’s distribution with less variance. This is because the algorithm is more effectively able to concentrate its work in areas where ’s value is large, proposing fewer mutations to parts of state space where ’s value is low. For example, if and the second mutation proposes , this will be accepted of the time, while mutations from to will be accepted of the time. Thus, we see how the algorithm naturally tends to try to avoid spending time sampling around the dip in the middle of the curve.

Figure 13.8: Comparison of Metropolis Sampling Strategies. (a) Only was used, randomly selecting a completely new value at each step. (b) Both and were used, with a 1:9 ratio. For the same number of samples, variance is substantially lower, thanks to a higher rate of acceptance of ’s proposed mutations.

One important thing to note about these graphs is that the axis has units that are different from those in Figure 13.7, where is graphed. Recall that Metropolis sampling provides a set of samples distributed according to ’s probability density; as such (for example), we would get the same sample distribution for another function . If we wish to reconstruct an approximation to directly from Metropolis samples, we must compute a normalization factor and use it to scale the PDF.

Figure 13.9 shows the surprising result of only using to propose sample values. On the left, 10,000 samples were taken using just that mutation. Clearly, things have gone awry—no samples were generated and the result doesn’t bear much resemblance to .

Figure 13.9: Why It Is Important to Periodically Pick a Completely New Sample Value with Metropolis Sampling. (a) 10,000 iterations using only were computed, (b) 300,000 iterations. It is very unlikely that a series of mutations will be able to move from one side of the curve, across .5, to the other side, since mutations that move toward areas where ’s value is low will usually be rejected. As such, the results are inaccurate for these numbers of iterations. (It’s small solace that they would be correct in the limit.)

Thinking about the acceptance probability again, we can see that it would take a large number of mutations, each with low probability of acceptance, to move down close enough to such that ’s short span would be enough to move over to the other side. Since the Metropolis algorithm tends to stay away from the lower valued regions of (recall the comparison of probabilities for moving from to versus moving from to ), this happens quite rarely. The bottom of Figure 13.9 shows what happens if 300,000 samples are taken. This was enough to be able to jump from one side of to the other a few times, but not enough to get close to the correct distribution. Using alone is thus not mathematically incorrect, just inefficient: it does have nonzero probability of proposing a transition from any state to any other state (through a chain of multiple transitions).

13.4.5 Estimating Integrals with Metropolis Sampling

We can apply the Metropolis algorithm to estimating integrals such as . Doing so is the basis of the Metropolis light transport algorithm implemented in Section 16.4.

To see how the samples from Metropolis sampling can be used in this way, recall that the standard Monte Carlo estimator, Equation (13.3), says that

where are sampled from a density function . Thus, if we apply Metropolis sampling and generate a set of samples, , from a density function that is proportional to , then we can estimate this integral as

(13.10)