13.4 Metropolis Sampling

Metropolis sampling is a technique for generating a set of samples from a non-negative function f that is distributed proportionally to f ’s value (Metropolis et al. 1953). Remarkably, it is able to do so without requiring anything more than the ability to evaluate f ; it is not necessary to be able to integrate f , normalize the integral, and invert the resulting CDF. Furthermore, every iteration produces a usable sample generated from the function’s PDF; Metropolis sampling doesn’t share the shortcoming of rejection sampling that the number of iterations needed to obtain the next sample cannot be bounded. It can thus efficiently generate samples from a wider variety of functions than the techniques introduced in the previous section. It forms the foundation of the Metropolis light transport algorithm implemented in Section 16.4.

Metropolis sampling does have a few disadvantages: successive samples in the sequence are statistically correlated, and it is thus not possible to ensure that a small set of samples generated by Metropolis is well distributed across the domain. It’s only in the limit over a large number of samples that the samples will cover the domain. As such, the variance reduction advantages of techniques like stratified sampling (Section 13.8.1) are generally not available when using Metropolis sampling.

13.4.1 Basic Algorithm

More concretely, the Metropolis algorithm generates a set of samples upper X Subscript i from a function f , which is defined over an arbitrary-dimensional state space normal upper Omega (frequently, normal upper Omega equals bold upper R Superscript n ) and returns a value in the reals, f colon normal upper Omega right-arrow bold upper R Superscript . After the first sample upper X 0 element-of normal upper Omega has been selected, each subsequent sample upper X Subscript i is generated by using a random mutation to upper X Subscript i minus 1 to compute a proposed sample upper X prime . The mutation may be accepted or rejected, and upper X Subscript i is accordingly set to either upper X prime or upper X Subscript i minus 1 . When these transitions from one state to another are chosen subject to a few requirements (to be described shortly), the distribution of upper X Subscript i values that results reaches an equilibrium distribution; this distribution is the stationary distribution. In the limit, the distribution of the set of samples upper X Subscript i Baseline element-of normal upper Omega is proportional to f left-parenthesis x right-parenthesis ’s probability density function p left-parenthesis x right-parenthesis equals f left-parenthesis x right-parenthesis slash integral Underscript normal upper Omega Endscripts f left-parenthesis x right-parenthesis normal d normal upper Omega .

In order to generate the correct distribution of samples, it is necessary to generate proposed mutations and then accept or reject the mutations subject to a few constraints. Assume that we have a mutation method that proposes changing from a given state upper X into a new state upper X prime (this might be done by perturbing upper X in some way, or even by generating a completely new value). We must be able to compute a tentative transition function upper T left-parenthesis upper X right-arrow upper X prime right-parenthesis that gives the probability density of the mutation technique’s proposing a transition to upper X prime , given that the current state is upper X . (Section 13.4.2 will discuss considerations for designing transition functions.)

Given a transition function, it is possible to define an acceptance probability a left-parenthesis upper X right-arrow upper X prime right-parenthesis that gives the probability of accepting a proposed mutation from upper X to upper X prime in a way that ensures that the distribution of samples is proportional to f left-parenthesis x right-parenthesis . If the distribution is already in equilibrium, the transition density between any two states must be equal:

f left-parenthesis upper X right-parenthesis upper T left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis a left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis equals f left-parenthesis upper X Superscript prime Baseline right-parenthesis upper T left-parenthesis upper X prime right-arrow upper X right-parenthesis a left-parenthesis upper X prime right-arrow upper X right-parenthesis period

This property is called detailed balance.

Since f and upper T are set, Equation (13.6) tells us how a must be defined. In particular, a definition of a that maximizes the rate at which equilibrium is reached is

a left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis equals min left-parenthesis 1 comma StartFraction f left-parenthesis upper X Superscript prime Baseline right-parenthesis upper T left-parenthesis upper X prime right-arrow upper X right-parenthesis Over f left-parenthesis upper X right-parenthesis upper T left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis EndFraction right-parenthesis period

One thing to immediately notice from Equation (13.7) is that, if the transition probability density is the same in both directions, the acceptance probability simplifies to

a left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis equals min left-parenthesis 1 comma StartFraction f left-parenthesis upper X Superscript prime Baseline right-parenthesis Over f left-parenthesis upper X right-parenthesis EndFraction right-parenthesis period

Put together, we have the basic Metropolis sampling algorithm in pseudocode:

X = X0 for i = 1 to n X' = mutate(X) a = accept(X, X') if (random() < a) X = X' record(X)

This code generates n samples by mutating the previous sample and computing acceptance probabilities as in Equation (13.7). Each sample upper X Subscript i can then be recorded in a data structure or used as a sample for integration.

Because the Metropolis algorithm naturally avoids parts of normal upper Omega where f left-parenthesis x right-parenthesis ’s value is relatively low, few samples will be accumulated there. In order to get some information about f left-parenthesis x right-parenthesis ’s behavior in such regions, the expected values technique can be used to enhance the basic Metropolis algorithm. In this case, we still decide which state to transition into as before, but we record a sample at each of upper X and upper X prime , regardless of which one is selected by the acceptance criteria. Each of these recorded samples has a weight associated with it, where the weights are the probabilities left-parenthesis 1 minus a right-parenthesis for upper X and a for upper X prime , where a is the acceptance probability. Expected values doesn’t change the way we decide which state, upper X or upper X prime , to use at the next step; that part of the computation remains the same.

Updated pseudocode shows the idea:

X = X0 for i = 1 to n X' = mutate(X) a = accept(X, X') record(X, 1 - a) record(X', a) if (random() < a) X = X'

Comparing the two pieces of pseudocode, we can see that in the limit, the same weight distribution will be accumulated for upper X and upper X prime . Expected values more quickly give a smoother result and more information about the areas where f left-parenthesis x right-parenthesis is low than the basic algorithm does.

13.4.2 Choosing Mutation Strategies

In general, one has a lot of freedom in choosing mutation strategies, subject to being able to compute the tentative transition density upper T left-parenthesis upper X right-arrow upper X prime right-parenthesis . Recall from Equation (13.8) that if the transition densities are symmetric then it is not even necessary to be able to compute them to apply the Metropolis sampling algorithm. It’s easy to apply multiple mutation strategies, so if there are some that are effective in only some circumstances it doesn’t hurt to try using them as one of a set of approaches.

It is generally desirable that mutations propose large changes to the current sample rather than small ones. Doing so more quickly explores the state space rather than letting the sampler get stuck in a small region of it. However, when the function’s value f left-parenthesis upper X right-parenthesis is relatively large at the current sample upper X , then it is likely that many proposed mutations will be rejected (consider the case where f left-parenthesis upper X right-parenthesis much-greater-than f left-parenthesis upper X prime right-parenthesis in Equation (13.8); a left-parenthesis upper X right-arrow upper X prime right-parenthesis will be very small). We’d like to avoid the case where many samples in a row are the same, again to better explore new parts of the state space: staying in the same state for many samples in a row leads to increased variance—intuitively, it makes sense that the more we move around normal upper Omega , the better the overall results will be. For this case, small mutations are likely to propose samples upper X prime where f is still relatively large, leading to higher acceptance properties.

Thus, one useful mutation approach is to apply random perturbations to the current sample upper X . If the sample upper X is represented by a vector of real numbers left-parenthesis x 0 comma x 1 comma ellipsis right-parenthesis , then some or all of the sample dimensions x Subscript i can be perturbed. One possibility is to perturb them by adding or subtracting a scaled random variable:

x prime Subscript i Baseline equals left-parenthesis x Subscript i Baseline plus-or-minus s xi Subscript Baseline right-parenthesis normal m normal o normal d 1

for some scale factor s and where the normal m normal o normal d operator wraps values around the boundaries to remain in left-bracket 0 comma 1 right-parenthesis . This method is symmetric, so we don’t need to compute the transition densities upper T left-parenthesis upper X right-arrow upper X prime right-parenthesis when using it with Metropolis sampling.

A related mutation approach is to just discard the current sample entirely and generate a new one with uniform random numbers:

x Subscript i Baseline equals xi Subscript Baseline period

(Note that this is also a symmetric method.) Occasionally generating a completely new sample in this manner is important since it ensures that we don’t get stuck in one part of the state space and never sample the rest of it. In general, it’s necessary that it be possible to reach all states upper X element-of normal upper Omega where f left-parenthesis upper X right-parenthesis greater-than 0 with nonzero probability (this property is called ergodicity). In particular, to ensure ergodicity it suffices that upper T left-parenthesis upper X right-arrow upper X prime right-parenthesis greater-than 0 for all upper X and upper X prime where f left-parenthesis upper X right-parenthesis greater-than 0 and f left-parenthesis upper X prime right-parenthesis greater-than 0 .

Another approach is to use PDFs that match some part of the function being sampled. If we have a PDF p left-parenthesis x right-parenthesis that is similar to some component of f , then we can use that to derive a mutation strategy by just drawing new samples upper X tilde p . In this case, the transition function is straightforward:

upper T left-parenthesis upper X right-arrow upper X Superscript prime Baseline right-parenthesis equals p left-parenthesis upper X Superscript prime Baseline right-parenthesis period

In other words, the current state upper X doesn’t matter for computing the transition density: we propose a transition into a state upper X prime with a density that depends only on the newly proposed state upper X prime and not at all on the current state.

13.4.3 Start-up Bias

One issue that we’ve sidestepped thus far is how the initial sample upper X 0 is computed. The transition and acceptance methods above tell us how to generate new samples upper X Subscript i plus 1 , but all presuppose that the current sample upper X Subscript i has itself already been sampled with probability proportional to f . Using a sample not from f ’s distribution leads to a problem called start-up bias.

A common solution to this problem is to run the Metropolis sampling algorithm for some number of iterations from an arbitrary starting state, discard the samples that are generated, and then start the process for real, assuming that that has brought us to an appropriately sampled upper X value. This is unsatisfying for two reasons: first, the expense of taking the samples that were then discarded may be high, and, second, we can only guess at how many initial samples must be taken in order to remove start-up bias.

An alternative approach can be used if another sampling method is available: an initial value upper X 0 is sampled using any density function upper X 0 tilde p left-parenthesis x right-parenthesis . We start the Markov chain from the state upper X 0 , but we weight the contributions of all of the samples that we generate by the weight

w equals StartFraction f left-parenthesis upper X 0 right-parenthesis Over p left-parenthesis upper X 0 right-parenthesis EndFraction period

This method eliminates start-up bias completely and does so in a predictable manner.

The only potential problem comes if f left-parenthesis upper X 0 right-parenthesis equals 0 for the upper X 0 we chose; in this case, all samples will have a weight of zero! This doesn’t mean that the algorithm is biased, however; the expected value of the result still converges to the correct distribution (see Veach (1997) for further discussion and for a proof of the correctness). To reduce variance and avoid this risk, we can instead sample a set of upper N candidate sample values, upper Y 1 comma ellipsis comma upper Y Subscript upper N Baseline , defining a weight for each by

w Subscript i Baseline equals StartFraction f left-parenthesis upper Y Subscript i Baseline right-parenthesis Over p left-parenthesis upper Y Subscript i Baseline right-parenthesis EndFraction period

We then choose the starting upper X 0 sample for the Metropolis algorithm from the upper Y Subscript i with probability proportional to their relative weights and compute a sample weight w as the average of all of the w Subscript i weights. All subsequent samples upper X Subscript i that are generated by the Metropolis algorithm are then weighted by the sample weight w .

13.4.4 1D Setting

In order to illustrate some of the ideas in this section, we’ll show how Metropolis sampling can be used to sample a simple 1D function, defined over normal upper Omega equals left-bracket 0 comma 1 right-bracket and 0 everywhere else (see Figure 13.7).

f left-parenthesis x right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column left-parenthesis x minus 0.5 right-parenthesis squared 2nd Column 0 less-than-or-equal-to x less-than-or-equal-to 1 2nd Row 1st Column 0 2nd Column otherwise period EndLayout

For this example, assume that we don’t actually know the exact form of f —it’s just a black box that we can evaluate at particular x values. (Clearly, if we knew that f was just Equation (13.9), there’d be no need for Metropolis in order to draw samples from its distribution!)

Figure 13.7: Graph of the function f left-parenthesis x right-parenthesis used to illustrate Metropolis sampling in this section.

We’ll define two mutation strategies based on the ideas introduced in Section 13.4.2 and randomly choose among them each time a mutation is proposed, according to a desired distribution of how frequently each is to be used.

The first strategy, normal m normal u normal t normal a normal t normal e Subscript 1 , discards the current sample upper X and uniformly samples a new one, upper X prime , from the entire state space left-bracket 0 comma 1 right-bracket . The transition function for this mutation is straightforward. For normal m normal u normal t normal a normal t normal e Subscript 1 , since we are uniformly sampling over left-bracket 0 comma 1 right-bracket , the probability density is uniform over the entire domain; in this case, the density is just one everywhere. We have

StartLayout 1st Row 1st Column Blank 2nd Column normal m normal u normal t normal a normal t normal e Subscript 1 Baseline left-parenthesis upper X right-parenthesis right-arrow xi Subscript Baseline 2nd Row 1st Column Blank 2nd Column upper T 1 left-parenthesis upper X right-arrow upper X prime right-parenthesis equals 1 period EndLayout

The second mutation adds a random offset between plus-or-minus 0.05 to the current sample upper X in an effort to sample repeatedly in the parts of f that make a high contribution to the overall distribution. The transition probability density is 0 if upper X and upper X prime are far enough away that normal m normal u normal t normal a normal t normal e Subscript 2 will never mutate from one to the other; otherwise, the density is constant. Normalizing the density so that it integrates to 1 over its domain gives the value 1 slash 0.1 . Both this and normal m normal u normal t normal a normal t normal e Subscript 1 are symmetric, so the transition densities aren’t needed to implement the sampling algorithm.

StartLayout 1st Row 1st Column Blank 2nd Column normal m normal u normal t normal a normal t normal e Subscript 2 Baseline left-parenthesis upper X right-parenthesis right-arrow upper X plus 0.1 left-parenthesis xi Subscript Baseline minus 0.5 right-parenthesis 2nd Row 1st Column Blank 2nd Column upper T 2 left-parenthesis upper X right-arrow upper X prime right-parenthesis equals StartLayout Enlarged left-brace 1st Row 1st Column StartFraction 1 Over 0.1 EndFraction 2nd Column StartAbsoluteValue upper X minus upper X Superscript prime Baseline EndAbsoluteValue less-than-or-equal-to 0.05 2nd Row 1st Column 0 2nd Column otherwise period EndLayout EndLayout

To find the initial sample, we only need to take a single sample with a uniform PDF over normal upper Omega , since f left-parenthesis x right-parenthesis greater-than 0 except for a single point in normal upper Omega for which there is zero probability of sampling:

upper X 0 equals xi Subscript Baseline period

The sample weight w is then just f left-parenthesis upper X 0 right-parenthesis .

We can now run the Metropolis algorithm and generate samples upper X Subscript i of f . At each transition, we have two weighted samples to record (recall the expected values pseudocode from Section 13.4.1). A simple approach for reconstructing the approximation to f ’s probability distribution is to store sums of the weights in a set of buckets of uniform width; each sample falls in a single bucket and contributes to it. Figure 13.8 shows some results. For both graphs, a chain of 10,000 mutations was followed, with the sample weights accumulated in 50 buckets over left-bracket 0 comma 1 right-bracket .

In the first graph, only normal m normal u normal t normal a normal t normal e Subscript 1 was used. This alone isn’t a very effective mutation, since it doesn’t take advantage of the times when it has found a sample in a region of normal upper Omega where f has a relatively large value to generate additional samples in that neighborhood. However, the graph does suggest that the algorithm is converging to the correct distribution.

In the second, one of normal m normal u normal t normal a normal t normal e Subscript 1 and normal m normal u normal t normal a normal t normal e Subscript 2 was randomly chosen, with probabilities of 10% and 90%, respectively. We see that for the same number of samples taken, we converge to f ’s distribution with less variance. This is because the algorithm is more effectively able to concentrate its work in areas where f ’s value is large, proposing fewer mutations to parts of state space where f ’s value is low. For example, if upper X equals .8 and the second mutation proposes upper X prime equals .75 , this will be accepted f left-parenthesis .75 right-parenthesis slash f left-parenthesis .8 right-parenthesis almost-equals 69 percent-sign of the time, while mutations from .75 to .8 will be accepted min left-parenthesis 1 comma 1.44 right-parenthesis equals 100 percent-sign of the time. Thus, we see how the algorithm naturally tends to try to avoid spending time sampling around the dip in the middle of the curve.

Figure 13.8: Comparison of Metropolis Sampling Strategies. (a) Only normal m normal u normal t normal a normal t normal e Subscript 1 was used, randomly selecting a completely new value at each step. (b) Both normal m normal u normal t normal a normal t normal e Subscript 1 and normal m normal u normal t normal a normal t normal e Subscript 2 were used, with a 1:9 ratio. For the same number of samples, variance is substantially lower, thanks to a higher rate of acceptance of normal m normal u normal t normal a normal t normal e Subscript 2 ’s proposed mutations.

One important thing to note about these graphs is that the y axis has units that are different from those in Figure 13.7, where f is graphed. Recall that Metropolis sampling provides a set of samples distributed according to f ’s probability density; as such (for example), we would get the same sample distribution for another function g equals 2 f . If we wish to reconstruct an approximation to f directly from Metropolis samples, we must compute a normalization factor and use it to scale the PDF.

Figure 13.9 shows the surprising result of only using normal m normal u normal t normal a normal t normal e Subscript 2 to propose sample values. On the left, 10,000 samples were taken using just that mutation. Clearly, things have gone awry—no samples upper X Subscript i Baseline greater-than .5 were generated and the result doesn’t bear much resemblance to f .

Figure 13.9: Why It Is Important to Periodically Pick a Completely New Sample Value with Metropolis Sampling. (a) 10,000 iterations using only normal m normal u normal t normal a normal t normal e Subscript 2 were computed, (b) 300,000 iterations. It is very unlikely that a series of mutations will be able to move from one side of the curve, across .5, to the other side, since mutations that move toward areas where f ’s value is low will usually be rejected. As such, the results are inaccurate for these numbers of iterations. (It’s small solace that they would be correct in the limit.)

Thinking about the acceptance probability again, we can see that it would take a large number of mutations, each with low probability of acceptance, to move upper X Subscript i down close enough to .5 such that normal m normal u normal t normal a normal t normal e Subscript 2 ’s short span would be enough to move over to the other side. Since the Metropolis algorithm tends to stay away from the lower valued regions of f (recall the comparison of probabilities for moving from .8 to .75 versus moving from .75 to .8 ), this happens quite rarely. The bottom of Figure 13.9 shows what happens if 300,000 samples are taken. This was enough to be able to jump from one side of .5 to the other a few times, but not enough to get close to the correct distribution. Using normal m normal u normal t normal a normal t normal e Subscript 2 alone is thus not mathematically incorrect, just inefficient: it does have nonzero probability of proposing a transition from any state to any other state (through a chain of multiple transitions).

13.4.5 Estimating Integrals with Metropolis Sampling

We can apply the Metropolis algorithm to estimating integrals such as integral f left-parenthesis x right-parenthesis g left-parenthesis x right-parenthesis normal d normal upper Omega . Doing so is the basis of the Metropolis light transport algorithm implemented in Section 16.4.

To see how the samples from Metropolis sampling can be used in this way, recall that the standard Monte Carlo estimator, Equation (13.3), says that

integral Underscript normal upper Omega Endscripts f left-parenthesis x right-parenthesis g left-parenthesis x right-parenthesis normal d normal upper Omega almost-equals StartFraction 1 Over upper N EndFraction sigma-summation Underscript i equals 1 Overscript upper N Endscripts StartFraction f left-parenthesis upper X Subscript i Baseline right-parenthesis g left-parenthesis upper X Subscript i Baseline right-parenthesis Over p left-parenthesis upper X Subscript i Baseline right-parenthesis EndFraction comma

where upper X Subscript i are sampled from a density function p left-parenthesis x right-parenthesis . Thus, if we apply Metropolis sampling and generate a set of samples, upper X 1 comma ellipsis comma upper X Subscript upper N Baseline , from a density function that is proportional to f left-parenthesis x right-parenthesis , then we can estimate this integral as

integral Underscript normal upper Omega Endscripts f left-parenthesis x right-parenthesis g left-parenthesis x right-parenthesis normal d normal upper Omega almost-equals left-bracket StartFraction 1 Over upper N EndFraction sigma-summation Underscript i equals 1 Overscript upper N Endscripts g left-parenthesis upper X Subscript i Baseline right-parenthesis right-bracket dot integral Underscript normal upper Omega Endscripts f left-parenthesis x right-parenthesis normal d normal upper Omega period