A.2 Reservoir Sampling
To perform the sampling operation, both SampleDiscrete() and alias tables require the number of outcomes being sampled from as well as all their probabilities to be stored in memory. Often this is not a problem, but for cases where we would like to draw a sample from a large number of events, or cases where each event requires a large amount of memory, it is useful to be able to generate samples without storing all of them at once.
A family of algorithms based on a technique called reservoir sampling makes this possible, by taking a stream of candidate samples one at a time and randomly keeping just one of them in a way that ensures that the sample that is kept is from the distribution defined by the samples that have been seen so far. Reservoir sampling algorithms date to the early days of computer tape drives, where data could only be accessed sequentially and there was often more of it than could be stored in main memory. Reservoir sampling made it possible to draw random samples from data stored on tape while only reading the tape once.
The basic reservoir sampling algorithm is easily expressed. Each candidate sample is stored in the reservoir with probability equal to one over the number of candidates that have been considered:
The correctness of this algorithm can be shown using induction. For the base case, it is clear that if there is a single sample, it will be stored in the reservoir, and the reservoir has successfully drawn a sample with the appropriate probability from the sample distribution.
Now consider the case where samples have been considered and assume that the sample stored in the reservoir has been kept with probability . When a new sample is considered, it will be kept with probability , which is clearly the correct probability for it. The existing sample is kept with probability ; the product of the probability of keeping the existing sample and its probability of being stored in the reservoir gives the correct probability, , as well.
Weighted reservoir sampling algorithms generalize the basic algorithm by making it possible to associate a nonnegative weight with each sample. Samples are then kept with probability given by the ratio of their weight to the sum of weights of all of the candidate samples that have been seen so far. The WeightedReservoirSampler class implements this algorithm. It is parameterized by the type of object being sampled T.
WeightedReservoirSampler stores an RNG object that provides the random numbers that are used in deciding whether to add each sample to the reservoir. The constructor correspondingly takes a seed value that is passed on to the RNG.
If an array of WeightedReservoirSamplers is allocated, then the default constructor runs instead. In that case, the RNGs in individual samplers can be seeded via the Seed() method.
The Add() method takes a single sample and a nonnegative weight value and updates the reservoir so that the stored sample is from the expected distribution.
The probability p for storing the sample candidate in the reservoir is easily found given weightSum.
The weight of the sample stored in the reservoir is stored in reservoirWeight; it is needed to compute the value of the probability mass function (PMF) for the sample that is kept.
A second Add() method takes a callback function that returns a sample. This function is only called when the sample is to be stored in the reservoir. This variant is useful in cases where the sample’s weight can be computed independently of its value and where its value is relatively expensive to compute. The fragment that contains its implementation, <<Process weighted reservoir sample via callback>>, otherwise follows the same structure as the first Add() method, so it is not included here.
A number of methods provide access to the sample and the probability that it was stored in the reservoir.
It is sometimes useful to reset a WeightedReservoirSampler and restart from scratch with a new stream of samples; the Reset() method handles this task.
Remarkably, it is possible to merge two reservoirs into one in such a way that the stored sample is kept with the same probability as if a single reservoir had considered all of the samples seen by the two. Merging two reservoirs is a matter of randomly taking the sample stored by the second reservoir with probability defined by its sum of sample weights divided by the sum of both reservoirs’ sums of sample weights, which in turn is exactly what the Add() method does.