Reservoir Sampling

A.2 Reservoir Sampling

To perform the sampling operation, both SampleDiscrete() and alias tables require the number of outcomes being sampled from as well as all their probabilities to be stored in memory. Often this is not a problem, but for cases where we would like to draw a sample from a large number of events, or cases where each event requires a large amount of memory, it is useful to be able to generate samples without storing all of them at once.

A family of algorithms based on a technique called reservoir sampling makes this possible, by taking a stream of candidate samples one at a time and randomly keeping just one of them in a way that ensures that the sample that is kept is from the distribution defined by the samples that have been seen so far. Reservoir sampling algorithms date to the early days of computer tape drives, where data could only be accessed sequentially and there was often more of it than could be stored in main memory. Reservoir sampling made it possible to draw random samples from data stored on tape while only reading the tape once.

The basic reservoir sampling algorithm is easily expressed. Each candidate sample is stored in the reservoir with probability equal to one over the number of candidates that have been considered:

The correctness of this algorithm can be shown using induction. For the base case, it is clear that if there is a single sample, it will be stored in the reservoir, and the reservoir has successfully drawn a sample with the appropriate probability from the sample distribution.

Now consider the case where samples have been considered and assume that the sample stored in the reservoir has been kept with probability . When a new sample is considered, it will be kept with probability , which is clearly the correct probability for it. The existing sample is kept with probability ; the product of the probability of keeping the existing sample and its probability of being stored in the reservoir gives the correct probability, , as well.

Weighted reservoir sampling algorithms generalize the basic algorithm by making it possible to associate a nonnegative weight with each sample. Samples are then kept with probability given by the ratio of their weight to the sum of weights of all of the candidate samples that have been seen so far. The WeightedReservoirSampler class implements this algorithm. It is parameterized by the type of object being sampled T.

<<WeightedReservoirSampler Definition>>=

template <typename T> class WeightedReservoirSampler { public: <<WeightedReservoirSampler Public Methods>>

WeightedReservoirSampler() = default; WeightedReservoirSampler(uint64_t rngSeed) : rng(rngSeed) {} void Seed(uint64_t seed) { rng.SetSequence(seed); } void Add(const T &sample, Float weight) { weightSum += weight; <<Randomly add sample to reservoir>>

Float p = weight / weightSum; if (rng.Uniform<Float>() < p) { reservoir = sample; reservoirWeight = weight; }

} template <typename F> void Add(F func, Float weight) { <<Process weighted reservoir sample via callback>>

weightSum += weight; Float p = weight / weightSum; if (rng.Uniform<Float>() < p) { reservoir = func(); reservoirWeight = weight; }

} void Copy(const WeightedReservoirSampler &wrs) { weightSum = wrs.weightSum; reservoir = wrs.reservoir; reservoirWeight = wrs.reservoirWeight; } int HasSample() const { return weightSum > 0; } const T &GetSample() const { return reservoir; } Float SampleProbability() const { return reservoirWeight / weightSum; } Float WeightSum() const { return weightSum; } void Reset() { reservoirWeight = weightSum = 0; } void Merge(const WeightedReservoirSampler &wrs) { if (wrs.HasSample()) Add(wrs.reservoir, wrs.weightSum); } std::string ToString() const { return StringPrintf("[ WeightedReservoirSampler rng: %s " "weightSum: %f reservoir: %s reservoirWeight: %f ]", rng, weightSum, reservoir, reservoirWeight); }

private: <<WeightedReservoirSampler Private Members>>

RNG rng; Float weightSum = 0; Float reservoirWeight = 0; T reservoir;

};

WeightedReservoirSampler stores an RNG object that provides the random numbers that are used in deciding whether to add each sample to the reservoir. The constructor correspondingly takes a seed value that is passed on to the RNG.

<<WeightedReservoirSampler Public Methods>>=

WeightedReservoirSampler() = default; WeightedReservoirSampler(uint64_t rngSeed) : rng(rngSeed) {}

<<WeightedReservoirSampler Private Members>>=

RNG rng;

If an array of WeightedReservoirSamplers is allocated, then the default constructor runs instead. In that case, the RNGs in individual samplers can be seeded via the Seed() method.

<<WeightedReservoirSampler Public Methods>>+=

void Seed(uint64_t seed) { rng.SetSequence(seed); }

The Add() method takes a single sample and a nonnegative weight value and updates the reservoir so that the stored sample is from the expected distribution.

<<WeightedReservoirSampler Public Methods>>+=

void Add(const T &sample, Float weight) { weightSum += weight; <<Randomly add sample to reservoir>>

Float p = weight / weightSum; if (rng.Uniform<Float>() < p) { reservoir = sample; reservoirWeight = weight; }

}

<<WeightedReservoirSampler Private Members>>+=

Float weightSum = 0;

The probability p for storing the sample candidate in the reservoir is easily found given weightSum.

<<Randomly add sample to reservoir>>=

Float p = weight / weightSum; if (rng.Uniform<Float>() < p) { reservoir = sample; reservoirWeight = weight; }

The weight of the sample stored in the reservoir is stored in reservoirWeight; it is needed to compute the value of the probability mass function (PMF) for the sample that is kept.

<<WeightedReservoirSampler Private Members>>+=

Float reservoirWeight = 0; T reservoir;

A second Add() method takes a callback function that returns a sample. This function is only called when the sample is to be stored in the reservoir. This variant is useful in cases where the sample’s weight can be computed independently of its value and where its value is relatively expensive to compute. The fragment that contains its implementation, <<Process weighted reservoir sample via callback>>, otherwise follows the same structure as the first Add() method, so it is not included here.

<<WeightedReservoirSampler Public Methods>>+=

template <typename F> void Add(F func, Float weight) { <<Process weighted reservoir sample via callback>>

weightSum += weight; Float p = weight / weightSum; if (rng.Uniform<Float>() < p) { reservoir = func(); reservoirWeight = weight; }

}

A number of methods provide access to the sample and the probability that it was stored in the reservoir.

<<WeightedReservoirSampler Public Methods>>+=

int HasSample() const { return weightSum > 0; } const T &GetSample() const { return reservoir; } Float SampleProbability() const { return reservoirWeight / weightSum; } Float WeightSum() const { return weightSum; }

It is sometimes useful to reset a WeightedReservoirSampler and restart from scratch with a new stream of samples; the Reset() method handles this task.

<<WeightedReservoirSampler Public Methods>>+=

void Reset() { reservoirWeight = weightSum = 0; }

Remarkably, it is possible to merge two reservoirs into one in such a way that the stored sample is kept with the same probability as if a single reservoir had considered all of the samples seen by the two. Merging two reservoirs is a matter of randomly taking the sample stored by the second reservoir with probability defined by its sum of sample weights divided by the sum of both reservoirs’ sums of sample weights, which in turn is exactly what the Add() method does.

<<WeightedReservoirSampler Public Methods>>+=

void Merge(const WeightedReservoirSampler &wrs) { if (wrs.HasSample()) Add(wrs.reservoir, wrs.weightSum); }