Path Tracer Implementation

15.3 Path Tracer Implementation

With these utility classes in hand, we can turn to the WavefrontPathIntegrator class implementation. As mentioned earlier, its functionality matches that of the VolPathIntegrator, restructured to run with a wavefront architecture.

The WavefrontPathIntegrator class declaration is in the file wavefront/integrator.h and some of its implementation is in wavefront/integrator.cpp, though a number of its method implementations are distributed across separate source files in the wavefront/ directory. While this is a different organization than we have used elsewhere in pbrt (where all the non-inline method definitions for a class defined in a file named file.h are in file.cpp), distributing them in this way reduces the time necessary to compile pbrt, since many of the methods make use of C++ features that can lead to long compile times. Spreading them out across multiple files allows multiple CPU cores to compile their implementations in parallel.

<<WavefrontPathIntegrator Definition>>=

class WavefrontPathIntegrator { public: <<WavefrontPathIntegrator Public Methods>>

Float Render(); void GenerateCameraRays(int y0, int sampleIndex); template <typename Sampler> void GenerateCameraRays(int y0, int sampleIndex); void GenerateRaySamples(int wavefrontDepth, int sampleIndex); template <typename Sampler> void GenerateRaySamples(int wavefrontDepth, int sampleIndex); void TraceShadowRays(int wavefrontDepth); void SampleMediumInteraction(int wavefrontDepth); template <typename PhaseFunction> void SampleMediumScattering(int wavefrontDepth); void SampleSubsurface(int wavefrontDepth); void HandleEscapedRays(); void HandleEmissiveIntersection(); void EvaluateMaterialsAndBSDFs(int wavefrontDepth); template <typename ConcreteMaterial> void EvaluateMaterialAndBSDF(int wavefrontDepth); template <typename ConcreteMaterial, typename TextureEvaluator> void EvaluateMaterialAndBSDF(MaterialEvalQueue *evalQueue, int wavefrontDepth); void UpdateFilm(); WavefrontPathIntegrator(pstd::pmr::memory_resource *memoryResource, BasicScene &scene); template <typename F> void ParallelFor(const char *description, int nItems, F &&func) { if (Options->useGPU) GPUParallelFor(description, nItems, func); else pbrt::ParallelFor(0, nItems, func); } template <typename F> void Do(const char *description, F &&func) { if (Options->useGPU) GPUParallelFor(description, 1, [=] PBRT_GPU (int) mutable { func(); }); else func(); } RayQueue *CurrentRayQueue(int wavefrontDepth) { return rayQueues[wavefrontDepth & 1]; } RayQueue *NextRayQueue(int wavefrontDepth) { return rayQueues[(wavefrontDepth + 1) & 1]; }

<<WavefrontPathIntegrator Member Variables>>

bool initializeVisibleSurface; bool haveSubsurface; bool haveMedia; pstd::array<bool, Material::NumTags()> haveBasicEvalMaterial; pstd::array<bool, Material::NumTags()> haveUniversalEvalMaterial; struct Stats { Stats(int maxDepth, Allocator alloc); std::string Print() const; // Note: not atomics: tid 0 always updates them for everyone... uint64_t cameraRays = 0; pstd::vector<uint64_t> indirectRays, shadowRays; }; Stats *stats; pstd::pmr::memory_resource *memoryResource; Filter filter; Film film; Sampler sampler; Camera camera; pstd::vector<Light> *infiniteLights; LightSampler lightSampler; int maxDepth, samplesPerPixel; bool regularize; int scanlinesPerPass, maxQueueSize; SOA<PixelSampleState> pixelSampleState; RayQueue *rayQueues[2]; WavefrontAggregate *aggregate = nullptr; MediumSampleQueue *mediumSampleQueue = nullptr; MediumScatterQueue *mediumScatterQueue = nullptr; EscapedRayQueue *escapedRayQueue = nullptr; HitAreaLightQueue *hitAreaLightQueue = nullptr; MaterialEvalQueue *basicEvalMaterialQueue = nullptr; MaterialEvalQueue *universalEvalMaterialQueue = nullptr; ShadowRayQueue *shadowRayQueue = nullptr; GetBSSRDFAndProbeRayQueue *bssrdfEvalQueue = nullptr; SubsurfaceScatterQueue *subsurfaceScatterQueue = nullptr;

};

The constructor converts the provided BasicScene into the objects that represent the scene for rendering. We will skip over the majority of its implementation here, however, as most of it just calls all the appropriate object Create() methods to allocate and initialize the corresponding objects (see Section C.3). All allocations are performed with Allocators that use the provided memory resource. When rendering on the GPU, this leads to the use of managed memory.

<<WavefrontPathIntegrator Public Methods>>=

WavefrontPathIntegrator(pstd::pmr::memory_resource *memoryResource, BasicScene &scene);

<<WavefrontPathIntegrator Member Variables>>=

pstd::pmr::memory_resource *memoryResource;

These are some of the key scene objects that are initialized in the constructor. As with the CPU-based Integrators, the infinite lights are stored independently so that they can be efficiently iterated over for rays that escape the scene.

<<WavefrontPathIntegrator Member Variables>>+=

Filter filter; Film film; Sampler sampler; Camera camera; pstd::vector<Light> *infiniteLights; LightSampler lightSampler;

There are a few additional member variables to store what should now be familiar configuration options.

<<WavefrontPathIntegrator Member Variables>>+=

int maxDepth, samplesPerPixel; bool regularize;

In order to limit memory use, this integrator places a limit on the number of pixel samples that it works on at once. All told, each active pixel sample requires roughly 1,000 bytes of additional storage for its state variables and to ensure that all the queues have sufficient space for the work to be done for the ray. (The actual amount of storage varies based on the scene, as the constructor is careful not to allocate work queues for the volume scattering kernels if there is no participating media in the scene, for example.)

The following fragment from the WavefrontPathIntegrator constructor therefore sets a maximum number of active samples and then determines how many scanlines of pixels that corresponds to. That value in turn determines how many passes (of rendering that many scanlines) are necessary to cover the full image resolution. Finally, scanlinesPerPass is set with a new value that evens out the number of scanlines rendered in each pass. This can help with load balancing by avoiding having a small number of scanlines in the final pass.

<<Compute number of scanlines to render per pass>>=

Vector2i resolution = film.PixelBounds().Diagonal(); int maxSamples = 1024 * 1024; scanlinesPerPass = std::max(1, maxSamples / resolution.x); int nPasses = (resolution.y + scanlinesPerPass - 1) / scanlinesPerPass; scanlinesPerPass = (resolution.y + nPasses - 1) / nPasses; maxQueueSize = resolution.x * scanlinesPerPass;

All the queues that are allocated to buffer work between kernels are also allocated to be able to store maxQueueSize individual work items. Thus, all queues are able to store one work item for each of the active pixel samples. This is a sufficient number, as none of the kernels in the current implementation ever push more than one item on a queue for each item processed. However, it may waste a substantial amount of memory. For example, there is a work queue for rays that hit emissive surfaces in the scene. It is rare that all the rays will hit emissive objects, yet there must be storage for all in case of that eventuality. The alternative, dynamically increasing the size of the queues when necessary, would be difficult to implement efficiently in the context of the massive parallelism on GPUs.

<<WavefrontPathIntegrator Member Variables>>+=

int scanlinesPerPass, maxQueueSize;

The WavefrontPathIntegrator does not use only queues to provide values to kernels and to communicate results. While the queue model is elegant, it can be inefficient if some values are computed early and not used until much later. For example, consider the VisibleSurface structure that is provided to Film implementations like the GBufferFilm: it is initialized at the first intersection after the camera, but then its value is not used again until the full ray path has been traced and the Film::AddSample() method is called at the end. VisibleSurface is, further, a relatively large structure. In a purely queue-based model, a substantial amount of memory bandwidth would be consumed passing it along from kernel to kernel until the end.

Therefore, the PixelSampleState structure is used for storing all such values. Various member variables will be added to it in what follows.

<<PixelSampleState Definition>>=

struct PixelSampleState { <<PixelSampleState Public Members>>

Point2i pPixel; SampledSpectrum L; SampledWavelengths lambda; Float filterWeight; VisibleSurface visibleSurface; SampledSpectrum cameraRayWeight; RaySamples samples;

};

The WavefrontPathIntegrator maintains an SOA-arranged array of PixelSampleStates, allocated to have maxQueueSize entries. Each sample’s index into this array is carried through the rendering computation so that it is easy to determine which entry corresponds to a pixel sample being processed.

<<WavefrontPathIntegrator Member Variables>>+=

SOA<PixelSampleState> pixelSampleState;

15.3.1 Work Launch

Because the WavefrontPathIntegrator may be running either on the CPU or on the GPU, it provides methods for launching work that use the appropriate processor. Each selects the appropriate type of processor based on the renderer’s configuration.

The first, ParallelFor(), selects between the types of two parallel for loops.

<<WavefrontPathIntegrator Public Methods>>+=

template <typename F> void ParallelFor(const char *description, int nItems, F &&func) { if (Options->useGPU) GPUParallelFor(description, nItems, func); else pbrt::ParallelFor(0, nItems, func); }

The Do() method executes the provided function in a single thread. On the CPU, it is no different than a regular function call; on the GPU, however, it executes the provided function using GPUParallelFor() with a single-item loop. For reasons that should be clear by now, this not a good way to do a meaningful amount of work on the GPU, but this capability is necessary for resetting counters, clearing queues, and the like.

<<WavefrontPathIntegrator Public Methods>>+=

template <typename F> void Do(const char *description, F &&func) { if (Options->useGPU) GPUParallelFor(description, 1, [=] PBRT_GPU (int) mutable { func(); }); else func(); }

15.3.2 The Render() Method

Rendering is initiated by a call to Render(). (Recall Figure 15.2 in Section 15.1.3, which summarizes the kernels that this method launches.) Similar to earlier integrators, it progressively takes more samples in all pixels until the requested number of samples have been taken. This method tracks how long rendering takes and returns the number of elapsed seconds; we have not included here the straightforward few lines of code that handle that.

One subtlety is that the initialization of the pixelBounds variable at the start is important for rendering performance. It will be necessary to have this value later on in the implementation of Render()—though because the Film is stored in managed memory, calling the PixelBounds() method after GPU kernels have been launched could incur the overhead of copying data back to the CPU.

<<WavefrontPathIntegrator Method Definitions>>=

Float WavefrontPathIntegrator::Render() { Bounds2i pixelBounds = film.PixelBounds(); Vector2i resolution = pixelBounds.Diagonal(); <<Loop over sample indices and evaluate pixel samples>>

int firstSampleIndex = 0, lastSampleIndex = samplesPerPixel; for (int sampleIndex = firstSampleIndex; sampleIndex < lastSampleIndex; ++sampleIndex) { <<Render image for sample sampleIndex>>

for (int y0 = pixelBounds.pMin.y; y0 < pixelBounds.pMax.y; y0 += scanlinesPerPass) { <<Generate camera rays for current scanline range>>

RayQueue *cameraRayQueue = CurrentRayQueue(0); Do("Reset ray queue", PBRT_CPU_GPU_LAMBDA () { cameraRayQueue->Reset(); }); GenerateCameraRays(y0, sampleIndex);

<<Trace rays and estimate radiance up to maximum ray depth>>

for (int wavefrontDepth = 0; true; ++wavefrontDepth) { <<Reset queues before tracing rays>>

RayQueue *nextQueue = NextRayQueue(wavefrontDepth); Do("Reset queues before tracing rays", PBRT_CPU_GPU_LAMBDA () { nextQueue->Reset(); <<Reset queues before tracing next batch of rays>>

if (mediumSampleQueue) mediumSampleQueue->Reset(); if (mediumScatterQueue) mediumScatterQueue->Reset(); if (escapedRayQueue) escapedRayQueue->Reset(); hitAreaLightQueue->Reset(); basicEvalMaterialQueue->Reset(); universalEvalMaterialQueue->Reset(); if (bssrdfEvalQueue) bssrdfEvalQueue->Reset(); if (subsurfaceScatterQueue) subsurfaceScatterQueue->Reset();

});

<<Follow active ray paths and accumulate radiance estimates>>

GenerateRaySamples(wavefrontDepth, sampleIndex); <<Find closest intersections along active rays>>

aggregate->IntersectClosest(maxQueueSize, CurrentRayQueue(wavefrontDepth), escapedRayQueue, hitAreaLightQueue, basicEvalMaterialQueue, universalEvalMaterialQueue, mediumSampleQueue, NextRayQueue(wavefrontDepth));

SampleMediumInteraction(wavefrontDepth); HandleEscapedRays(); HandleEmissiveIntersection(); if (wavefrontDepth == maxDepth) break; EvaluateMaterialsAndBSDFs(wavefrontDepth); TraceShadowRays(wavefrontDepth); SampleSubsurface(wavefrontDepth);

}

UpdateFilm(); }

}

By default, the number of samples taken in each pixel is the number determined by the Sampler; the value returned by its SamplesPerPixel() method is cached in a member variable for the same reason as pixelBounds is cached above. It is possible to limit rendering to a single specified sample index using the –debugstart command line option; the code to set firstSampleIndex and lastSampleIndex in that case is not included here.

<<Loop over sample indices and evaluate pixel samples>>=

int firstSampleIndex = 0, lastSampleIndex = samplesPerPixel; for (int sampleIndex = firstSampleIndex; sampleIndex < lastSampleIndex; ++sampleIndex) { <<Render image for sample sampleIndex>>

for (int y0 = pixelBounds.pMin.y; y0 < pixelBounds.pMax.y; y0 += scanlinesPerPass) { <<Generate camera rays for current scanline range>>

RayQueue *cameraRayQueue = CurrentRayQueue(0); Do("Reset ray queue", PBRT_CPU_GPU_LAMBDA () { cameraRayQueue->Reset(); }); GenerateCameraRays(y0, sampleIndex);

<<Trace rays and estimate radiance up to maximum ray depth>>

for (int wavefrontDepth = 0; true; ++wavefrontDepth) { <<Reset queues before tracing rays>>

RayQueue *nextQueue = NextRayQueue(wavefrontDepth); Do("Reset queues before tracing rays", PBRT_CPU_GPU_LAMBDA () { nextQueue->Reset(); <<Reset queues before tracing next batch of rays>>

});

<<Follow active ray paths and accumulate radiance estimates>>

GenerateRaySamples(wavefrontDepth, sampleIndex); <<Find closest intersections along active rays>>

}

UpdateFilm(); }

}

Given a sample index, the next step is to loop over the one or more chunks of scanlines and to evaluate a sample in each of their pixels. This code is also similar to the corresponding code in the VolPathIntegrator: evaluating each sample starts with generating a camera ray and then following it through multiple intersections until it is terminated or the maximum depth is reached. The key difference is that, here, these tasks are being performed for as many as a million or so pixel samples at a time.

<<Render image for sample sampleIndex>>=

for (int y0 = pixelBounds.pMin.y; y0 < pixelBounds.pMax.y; y0 += scanlinesPerPass) { <<Generate camera rays for current scanline range>>

RayQueue *cameraRayQueue = CurrentRayQueue(0); Do("Reset ray queue", PBRT_CPU_GPU_LAMBDA () { cameraRayQueue->Reset(); }); GenerateCameraRays(y0, sampleIndex);

<<Trace rays and estimate radiance up to maximum ray depth>>

for (int wavefrontDepth = 0; true; ++wavefrontDepth) { <<Reset queues before tracing rays>>

RayQueue *nextQueue = NextRayQueue(wavefrontDepth); Do("Reset queues before tracing rays", PBRT_CPU_GPU_LAMBDA () { nextQueue->Reset(); <<Reset queues before tracing next batch of rays>>

});

<<Follow active ray paths and accumulate radiance estimates>>

GenerateRaySamples(wavefrontDepth, sampleIndex); <<Find closest intersections along active rays>>

}

UpdateFilm(); }

Before the rays are generated, it is necessary to reset the work queue that will store them. We will need to maintain more than one ray queue: one for the set of rays currently being traced and another for the indirect rays that have been spawned to be traced at the next depth. The CurrentRayQueue() method, defined shortly, returns the queue for the specified depth.

When the GPU is being used for rendering, it is critically important that the Reset() method is called from the GPU and not the CPU; Do() is thus used here. Not only could resetting it from the CPU be inefficient, as doing so would involve the CPU writing to managed memory, but—given the asynchronous execution of the GPU—it would almost certainly be incorrect, potentially resetting a queue that was still in use by the code that was executing on the GPU.

<<Generate camera rays for current scanline range>>=

RayQueue *cameraRayQueue = CurrentRayQueue(0); Do("Reset ray queue", PBRT_CPU_GPU_LAMBDA () { cameraRayQueue->Reset(); }); GenerateCameraRays(y0, sampleIndex);

The RayQueue class adds a few convenience methods to WorkQueue that we will discuss in the following pages. (RayWorkItem will also be defined in a few pages, closer to where it is first used.)

<<RayQueue Definition>>=

class RayQueue : public WorkQueue<RayWorkItem> { public: <<RayQueue Public Methods>>

int PushCameraRay(const Ray &ray, const SampledWavelengths &lambda, int pixelIndex); int PushIndirectRay(const Ray &ray, int depth, const LightSampleContext &prevIntrCtx, const SampledSpectrum &beta, const SampledSpectrum &r_u, const SampledSpectrum &r_l, const SampledWavelengths &lambda, Float etaScale, bool specularBounce, bool anyNonSpecularBounces, int pixelIndex);

};

The WavefrontPathIntegrator maintains a pair of ray queues, rather than allocating one for each ray depth up to the maximum. It manages them using double buffering: one stores the current active set of rays and should only be read from, while the other stores the rays enqueued to trace at the next depth and should only be written to. At each successive ray depth, the roles of the two queues are switched.

<<WavefrontPathIntegrator Member Variables>>+=

RayQueue *rayQueues[2];

Two convenience methods return pointers to the RayQueues given a depth. The double buffering logic is effectively encapsulated in their implementations.

<<WavefrontPathIntegrator Public Methods>>+=

RayQueue *CurrentRayQueue(int wavefrontDepth) { return rayQueues[wavefrontDepth & 1]; } RayQueue *NextRayQueue(int wavefrontDepth) { return rayQueues[(wavefrontDepth + 1) & 1]; }

15.3.3 Generating Camera Rays

The two methods related to generating camera rays are implemented in the file wavefront/camera.cpp, not in wavefront/integrator.cpp, because we have used C++ templates to generate multiple specialized instances of the methods, each one specialized based on some of the object types involved. As mentioned earlier, techniques like this can cause lengthy compile times, so it is worthwhile to isolate the camera method implementations in their own source file.

There are two GenerateCameraRays() methods. The first, which is called by WavefrontPathIntegrator::Render(), determines the concrete type of the Sampler being used and then calls the second GenerateCameraRays() method, which is templated on the type of the Sampler. This idea is something that we will see again in other methods: CPU code making a runtime determination of the types of objects involved in a computation, which allows the execution of code that is specialized for those types. This is especially beneficial to performance when running on the GPU.

The first method defines a lambda function, generateRays, and then invokes the TaggedPointer’s dynamic dispatch mechanism, which will end up calling the lambda function using the concrete type of the Sampler. (In this case, we use the DispatchCPU() variant, which must be used for code that can only execute on the CPU.)

<<WavefrontPathIntegrator Camera Ray Methods>>=

void WavefrontPathIntegrator::GenerateCameraRays(int y0, int sampleIndex) { <<Define generateRays lambda function>>

auto generateRays = [=](auto sampler) { using ConcreteSampler = std::remove_reference_t<decltype(*sampler)>; if constexpr (!std::is_same_v<ConcreteSampler, MLTSampler> && !std::is_same_v<ConcreteSampler, DebugMLTSampler>) GenerateCameraRays<ConcreteSampler>(y0, sampleIndex); };

sampler.DispatchCPU(generateRays); }

Using auto in the parameter declaration for the following lambda function causes it to be parameterized by the type of sampler, like a template function. (C++20 provides a less obscure syntax for templated lambda functions, though this version of pbrt limits itself to C++17.) Thus, the type of sampler will be a concrete instance of one of the sampler types provided in the Sampler declaration in Section 8.3.

There is a nit in that sampler is passed as a reference to a pointer to the specific Sampler type; a bit of work in the using declaration gives us the actual Sampler type. A second nit is that we must filter out the MLTSampler, which is only used by the MLTIntegrator, and DebugMLTSampler, a variant of it that is used for debugging. Those classes use vector methods like push_back() that are not supported in GPU code, and therefore we must make clear to the compiler what we know in any case: those samplers will not be used here.

With the concrete sampler type in hand, the second GenerateCameraRays() method can be called, with the sampler type provided for the template specialization.

<<Define generateRays lambda function>>=

We can now move on to the second GenerateCameraRays() method, which calls WavefrontPathIntegrator::ParallelFor() to generate all of the rays in parallel.

<<WavefrontPathIntegrator Camera Ray Methods>>+=

template <typename ConcreteSampler> void WavefrontPathIntegrator::GenerateCameraRays(int y0, int sampleIndex) { RayQueue *rayQueue = CurrentRayQueue(0); ParallelFor("Generate camera rays", maxQueueSize, PBRT_CPU_GPU_LAMBDA (int pixelIndex) { <<Enqueue camera ray and set pixel state for sample>>

<<Compute pixel coordinates for pixelIndex>>

Bounds2i pixelBounds = film.PixelBounds(); int xResolution = pixelBounds.pMax.x - pixelBounds.pMin.x; Point2i pPixel(pixelBounds.pMin.x + pixelIndex % xResolution, y0 + pixelIndex / xResolution); pixelSampleState.pPixel[pixelIndex] = pPixel;

<<Test pixel coordinates against pixel bounds>>

if (!InsideExclusive(pPixel, pixelBounds)) return;

<<Initialize Sampler for current pixel and sample>>

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); pixelSampler.StartPixelSample(pPixel, sampleIndex, 0);

<<Sample wavelengths for ray path>>

Float lu = pixelSampler.Get1D(); SampledWavelengths lambda = film.SampleWavelengths(lu);

<<Generate CameraSample and corresponding ray>>

CameraSample cameraSample = GetCameraSample(pixelSampler, pPixel, filter); pstd::optional<CameraRay> cameraRay = camera.GenerateRay(cameraSample, lambda);

<<Initialize remainder of PixelSampleState for ray>>

pixelSampleState.L[pixelIndex] = SampledSpectrum(0.f); pixelSampleState.lambda[pixelIndex] = lambda; pixelSampleState.filterWeight[pixelIndex] = cameraSample.filterWeight; if (initializeVisibleSurface) pixelSampleState.visibleSurface[pixelIndex] = VisibleSurface();

<<Enqueue camera ray for intersection tests>>

if (cameraRay) { rayQueue->PushCameraRay(cameraRay->ray, lambda, pixelIndex); pixelSampleState.cameraRayWeight[pixelIndex] = cameraRay->weight; } else pixelSampleState.cameraRayWeight[pixelIndex] = SampledSpectrum(0);

}); }

The sequence of operations performed for each camera ray again generally matches what we have seen before: after computing the pixel coordinates for the provided loop index, samples are generated, a set of wavelengths are sampled, and then the Camera generates the ray.

<<Enqueue camera ray and set pixel state for sample>>=

<<Compute pixel coordinates for pixelIndex>>

<<Test pixel coordinates against pixel bounds>>

if (!InsideExclusive(pPixel, pixelBounds)) return;

<<Initialize Sampler for current pixel and sample>>

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); pixelSampler.StartPixelSample(pPixel, sampleIndex, 0);

<<Sample wavelengths for ray path>>

Float lu = pixelSampler.Get1D(); SampledWavelengths lambda = film.SampleWavelengths(lu);

<<Generate CameraSample and corresponding ray>>

CameraSample cameraSample = GetCameraSample(pixelSampler, pPixel, filter); pstd::optional<CameraRay> cameraRay = camera.GenerateRay(cameraSample, lambda);

<<Initialize remainder of PixelSampleState for ray>>

<<Enqueue camera ray for intersection tests>>

Given the pixel bounds of the film, the pixelIndex value is mapped to pixel coordinates, starting from the lower bound given by the pixel bounds and y0 value, respectively. Threads are then assigned to pixels in scanline order. Because the PBRT_GPU_LAMBDA macro includes *this in the lambda capture, various WavefrontPathIntegrator member variables such as film, which is used here, can be directly accessed in the lambda function.

The pixel coordinates are then stored in PixelSampleState. Note that if we are just setting a single member variable of an SOA structure rather than assigning an entire structure value, then the member variable must be indexed and not the pixelSampleState structure itself; our automatically generated SOA classes are not able to provide the same syntax as would be used for an array of structures layout in that case.

<<Compute pixel coordinates for pixelIndex>>=

The pixel coordinates corresponding to a sample are our first addition to the PixelSampleState structure.

<<PixelSampleState Public Members>>=

Point2i pPixel;

If the image has been split into multiple spans of scanlines, then the number of parallel loop iterations in the final pass may be a few more than there are pixels left to be sampled. Rather than worry about this when specifying the number of threads to launch in the call to WavefrontPathIntegrator::ParallelFor(), we just check this condition in the kernel. Indices for out of bounds pixels return after initializing PixelSampleState::pPixel and do not enqueue any rays.

<<Test pixel coordinates against pixel bounds>>=

if (!InsideExclusive(pPixel, pixelBounds)) return;

Next, samples are needed to generate the camera ray. Here we can see the value of specializing this method based on the ConcreteSampler type. The following fragment is perhaps best understood by reading it with the ConcreteSampler in the following replaced with, say, HaltonSampler in your head. pixelSampler then is a stack-allocated HaltonSampler, and the TaggedPointer::Cast() call gives us a pointer of that type; dereferencing it lets each thread initialize its own sampler from the exemplar of the one that the WavefrontPathIntegrator stores. All threads access the same memory locations, so reading the Sampler from memory does not consume much bandwidth.

The benefit from this approach comes from having the sampler allocated on the stack. On the GPU, its member variables can be stored directly in registers, giving high performance when executing its methods. As an additional bonus, there is no overhead for dynamic dispatch in the StartPixelSample() call: the sampler’s type is known, so the appropriate method can be called directly. It will usually be expanded inline at the call site.

<<Initialize Sampler for current pixel and sample>>=

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); pixelSampler.StartPixelSample(pPixel, sampleIndex, 0);

Wavelengths are sampled in precisely the same way as they are in most of the other integrators.

<<Sample wavelengths for ray path>>=

Float lu = pixelSampler.Get1D(); SampledWavelengths lambda = film.SampleWavelengths(lu);

The CameraSample can be initialized in the usual way and then it is on to call Camera::GenerateRay(). In this case, this method call is also resolved using pbrt’s usual dynamic dispatch system, just as it is in all the other integrators. An alternative implementation approach would be to further specialize the GenerateCameraRays() method based on the type of the Camera being used; after all, it is the same for all pixel samples and so it is somewhat wasteful for all threads to perform the same computations for dynamic dispatch. We have found that in practice that alternative gives a negligible performance benefit, and so our implementation here remains based on dynamic dispatch.

One other thing to note is that the wavefront integrator path generates regular Rays here and not RayDifferentials. Approximate differentials for filtering will be computed later, trading off superior antialiasing quality in exchange for higher performance from reduced memory bandwidth due to having less information associated with each ray. An exercise at the end of the chapter revisits this choice.

<<Generate CameraSample and corresponding ray>>=

CameraSample cameraSample = GetCameraSample(pixelSampler, pPixel, filter); pstd::optional<CameraRay> cameraRay = camera.GenerateRay(cameraSample, lambda);

A few additional values that are stored in PixelSampleState can now be set. We avoid initializing the heavyweight VisibleSurface member if it is not going to be used by the Film; the initializeVisibleSurface member variable is set in the WavefrontPathIntegrator constructor to record whether it is needed. Saving this memory bandwidth when possible is worth this easy check.

<<Initialize remainder of PixelSampleState for ray>>=

Of these member variables, we will see L most often in what follows. It stores the accumulated radiance estimate for the ray path. After the path terminates, its value will be provided to the Film.

<<PixelSampleState Public Members>>+=

SampledSpectrum L; SampledWavelengths lambda; Float filterWeight; VisibleSurface visibleSurface;

If the Camera successfully generated a ray, it is pushed into the RayQueue along with its wavelengths and associated pixel index, which allows subsequent kernels to be able to index into WavefrontPathIntegrator::pixelSampleState to retrieve values there that are associated with this ray.

<<Enqueue camera ray for intersection tests>>=

<<PixelSampleState Public Members>>+=

SampledSpectrum cameraRayWeight;

Some of the work queues provide specialized methods for pushing work on to them; RayQueue is one of them. It provides a PushCameraRay() method for camera rays and PushIndirectRay() for scattered rays that sample indirect lighting. Doing so not only makes the code where work is pushed on to the queue slightly cleaner, but it also makes it possible to set some of the forthcoming RayWorkItem member variables to default values for camera rays without needing to specify default values here. (For example, RayWorkItem carries path sampling PDFs that are initialized to 1 for camera rays.) We will pass over the implementation of this method here, as it is all setting member variables, either from passed-in values or from defaults.

<<RayQueue Public Methods>>=

int PushCameraRay(const Ray &ray, const SampledWavelengths &lambda, int pixelIndex);

We will start the definition of the RayWorkItem structure here.

<<RayWorkItem Definition>>=

struct RayWorkItem { <<RayWorkItem Public Members>>

Ray ray; int depth; SampledWavelengths lambda; int pixelIndex; SampledSpectrum beta, r_u, r_l; LightSampleContext prevIntrCtx; Float etaScale; int specularBounce; int anyNonSpecularBounces;

};

As would be expected from the parameters passed to PushCameraRay(), RayWorkItem stores a ray, its wavelengths, and an index into the WavefrontPathIntegrator’s pixelSampleState array from which additional data associated with the ray can be found.

<<RayWorkItem Public Members>>=

Ray ray; int depth; SampledWavelengths lambda; int pixelIndex;

The reader may have noted that lambda is both pushed to the ray queue and stored in the PixelSampleState, which is admittedly redundant, though there are good reasons for both. Its value must be stored in the PixelSampleState structure, as it is required for updating the film, which is not scheduled using work queues but via a loop over the PixelSampleState values; see further discussion of this design in Section 15.3.11.

However, if other kernels along the way accessed lambda via PixelSampleState, performance would be poor, since the initial correlation between the value of pixelIndex and threads in a thread group quickly becomes shuffled up over the course of rendering. Thus, reads from PixelSampleState would be highly incoherent. Because it is used in just about every kernel, lambda is also passed along throughout the queues of the wavefront integrator. Since it is in work queues, the loads and stores to read and save its value are coherent across the thread group, giving good performance.

15.3.4 Loop over Ray Depths

With the camera having seeded the ray queue with an initial set of rays, we can now turn to the loop that executes once for each successive set of ray intersection tests and shading calculations up until the maximum ray path length. One subtlety is that this loop is over “wavefront depth,” which is different than the ray depth that has been tracked in other integrators. Here, the wavefront depth reflects the number of times that this loop has executed, tracing a batch of rays and processing their intersections. Each ray tracks its own depth, which is usually the same as the wavefront depth, though the depth of rays that hit invisible surfaces that delineate the boundaries between volumetric media is not incremented at those intersections (as in other integrators).

This loop always runs on the CPU; when the GPU is used for rendering, it is responsible for launching the appropriate kernels to perform the rendering computation. An implication of this design and the decoupling of the CPU and GPU is that the CPU has no visibility into the state of the ray-tracing calculations on the GPU. For example, if all rays terminate after the first intersection, the CPU will continue submitting kernel launches to execute the ray-tracing pipeline even though no work is passing through it. All the kernels would exit immediately, though there would be a cost from launching all of them and their determining that there is no work to be done. For many scenes this is not a problem, but it can harm this integrator’s performance with high maximum ray depths. An exercise at the end of the chapter outlines some design alternatives.

<<Trace rays and estimate radiance up to maximum ray depth>>=

for (int wavefrontDepth = 0; true; ++wavefrontDepth) { <<Reset queues before tracing rays>>

RayQueue *nextQueue = NextRayQueue(wavefrontDepth); Do("Reset queues before tracing rays", PBRT_CPU_GPU_LAMBDA () { nextQueue->Reset(); <<Reset queues before tracing next batch of rays>>

});

<<Follow active ray paths and accumulate radiance estimates>>

GenerateRaySamples(wavefrontDepth, sampleIndex); <<Find closest intersections along active rays>>

}

All the work queues except for the RayQueue holding the current set of rays must be cleared at the start of each iteration. The following fragment clears the RayQueue for the next set of indirect rays and adds a fragment for the rest of the queues. We will add additional Reset() calls to this fragment for the various other queues along with the code that defines and uses them.

<<Reset queues before tracing rays>>=

RayQueue *nextQueue = NextRayQueue(wavefrontDepth); Do("Reset queues before tracing rays", PBRT_CPU_GPU_LAMBDA () { nextQueue->Reset(); <<Reset queues before tracing next batch of rays>>

});

Once the queues are cleared, rendering proceeds mostly following the same steps as the VolPathIntegrator. First, the Sampler computes sample values for all of the rays and stores them in memory.

<<Follow active ray paths and accumulate radiance estimates>>=

GenerateRaySamples(wavefrontDepth, sampleIndex);

The closest intersection with a surface, if any, is then found for each ray.

<<Follow active ray paths and accumulate radiance estimates>>+=

<<Find closest intersections along active rays>>

Before surface scattering or emission is considered, the medium (if any) is sampled. If the medium scatters or absorbs the ray, then any surface intersection will be ignored.

<<Follow active ray paths and accumulate radiance estimates>>+=

SampleMediumInteraction(wavefrontDepth);

Only after medium sampling are rays that left the scene taken care of. In this way, rays passing through participating media that do not interact with it and then leave the scene can be processed at the same time as rays that left the scene but did not pass through media.

<<Follow active ray paths and accumulate radiance estimates>>+=

HandleEscapedRays();

The contribution of emissive surfaces to a sample’s radiance value is also only added after medium sampling, since it should not be included if the medium scattered or absorbed the ray.

<<Follow active ray paths and accumulate radiance estimates>>+=

HandleEmissiveIntersection();

The loop over wavefront depth can only be terminated after emissive surfaces have been accounted for, since the MIS-based direct lighting calculation accounts for their contribution.

<<Follow active ray paths and accumulate radiance estimates>>+=

if (wavefrontDepth == maxDepth) break;

If the loop does not terminate, only now are surface intersections handled, with the Materials evaluating their textures and returning BSDFs, lights sampled, and indirect rays enqueued.

<<Follow active ray paths and accumulate radiance estimates>>+=

EvaluateMaterialsAndBSDFs(wavefrontDepth);

Next, the shadow rays enqueued by the material evaluation kernel are traced; the radiance contributions of the unoccluded ones are accumulated in PixelSampleState.

<<Follow active ray paths and accumulate radiance estimates>>+=

TraceShadowRays(wavefrontDepth);

The following sections go into these steps in more detail.

15.3.5 Sample Generation

The first step in each loop iteration is to generate values for all the sample dimensions that may be required for sampling lighting if a scattering event is found along the ray, either in a participating medium or on a surface. Generating all of these samples ahead of time in a separate kernel allows for specializing that kernel based on the sampler type, giving the same performance benefits as were found in GenerateCameraRays().

The GenerateRaySamples() method is defined in the file wavefront/samples.cpp. Similar to camera rays, there is an initial GenerateRaySamples() method called in the main rendering loop that determines the actual Sampler type and invokes the appropriate specialization. Because this dispatch method is nearly the same as the analog in GenerateCameraRays(), we omit it here and turn directly to the specialized method.

<<WavefrontPathIntegrator Sampler Methods>>=

template <typename ConcreteSampler> void WavefrontPathIntegrator::GenerateRaySamples(int wavefrontDepth, int sampleIndex) { <<Generate description string desc for ray sample generation>>

std::string desc = std::string("Generate ray samples - ") + ConcreteSampler::Name();

RayQueue *rayQueue = CurrentRayQueue(wavefrontDepth); ForAllQueued(desc.c_str(), rayQueue, maxQueueSize, PBRT_CPU_GPU_LAMBDA (const RayWorkItem w) { <<Generate samples for ray segment at current sample index>>

<<Find first sample dimension>>

int dimension = 6 + 7 * w.depth; if (haveMedia) dimension += 2 * w.depth;

<<Initialize Sampler for pixel, sample index, and dimension>>

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); Point2i pPixel = pixelSampleState.pPixel[w.pixelIndex]; pixelSampler.StartPixelSample(pPixel, sampleIndex, dimension);

<<Initialize RaySamples structure with sample values>>

RaySamples rs; rs.direct.uc = pixelSampler.Get1D(); rs.direct.u = pixelSampler.Get2D(); <<Initialize remaining samples in rs>>

rs.indirect.uc = pixelSampler.Get1D(); rs.indirect.u = pixelSampler.Get2D(); rs.indirect.rr = pixelSampler.Get1D(); rs.haveMedia = haveMedia; if (haveMedia) { rs.media.uDist = pixelSampler.Get1D(); rs.media.uMode = pixelSampler.Get1D(); }

<<Store RaySamples in pixel sample state>>

pixelSampleState.samples[w.pixelIndex] = rs;

}); }

Unlike the CPU-only integrators, where sample dimensions are allocated implicitly based on the runtime sequence of Get1D() and Get2D() method calls, here dimensions are explicitly allocated, ensuring that unique dimensions are assigned to different uses of samples. Doing so imposes a coupling between this kernel and the use of samples in following ones and also means that samples may be generated that are not actually used (e.g., if a perfect specular surface is intersected). These costs are worth paying in return for the performance benefits of generating samples in a specialized kernel.

<<Generate samples for ray segment at current sample index>>=

<<Find first sample dimension>>

int dimension = 6 + 7 * w.depth; if (haveMedia) dimension += 2 * w.depth;

<<Initialize Sampler for pixel, sample index, and dimension>>

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); Point2i pPixel = pixelSampleState.pPixel[w.pixelIndex]; pixelSampler.StartPixelSample(pPixel, sampleIndex, dimension);

<<Initialize RaySamples structure with sample values>>

RaySamples rs; rs.direct.uc = pixelSampler.Get1D(); rs.direct.u = pixelSampler.Get2D(); <<Initialize remaining samples in rs>>

<<Store RaySamples in pixel sample state>>

pixelSampleState.samples[w.pixelIndex] = rs;

The first task is to find the first dimension to allocate for this ray’s samples. The first 5 sample dimensions are used to generate the camera ray, and then 1 is used to sample the wavelengths. At least 7 samples are needed for each ray depth: 3 for the call to BSDF::Sample_f(), 1 to sample a light source and then 2 to sample a position on it, and then 1 more for Russian roulette for the indirect ray. Two additional dimensions are consumed for sampling participating media. (The haveMedia WavefrontPathIntegrator member variable is set in its constructor based on the scene.)

<<Find first sample dimension>>=

int dimension = 6 + 7 * w.depth; if (haveMedia) dimension += 2 * w.depth;

This kernel uses the same trick to get a stack-allocated Sampler as was done for camera rays. In this case, figuring out which pixel the ray is associated with requires a read from the PixelSampleState.

<<Initialize Sampler for pixel, sample index, and dimension>>=

ConcreteSampler pixelSampler = *sampler.Cast<ConcreteSampler>(); Point2i pPixel = pixelSampleState.pPixel[w.pixelIndex]; pixelSampler.StartPixelSample(pPixel, sampleIndex, dimension);

The RaySamples structure bundles up the samples needed for a ray. A series of Get1D() and Get2D() calls initializes its member variables. We omit the fragment that initializes the indirect as it is more of the same. We have carefully ordered the sample generation method calls below to match their use in the VolPathIntegrator so that the two integrators use the same sample values for their sampling tasks at each pixel sample.

<<Initialize RaySamples structure with sample values>>=

RaySamples rs; rs.direct.uc = pixelSampler.Get1D(); rs.direct.u = pixelSampler.Get2D(); <<Initialize remaining samples in rs>>

<<RaySamples Definition>>=

struct RaySamples { <<RaySamples Public Members>>

struct { Point2f u; Float uc; } direct; struct { Float uc, rr; Point2f u; } indirect; bool haveSubsurface; struct { Float uc; Point2f u; } subsurface; bool haveMedia; struct { Float uDist, uMode; } media;

};

The three sample dimensions for sampling the light source are available in the direct substructure.

<<RaySamples Public Members>>=

struct { Point2f u; Float uc; } direct;

Similarly, the dimensions for BSDF sampling and Russian roulette are available in indirect.

<<RaySamples Public Members>>+=

struct { Float uc, rr; Point2f u; } indirect;

haveMedia indicates whether medium samples have been stored, which makes it possible to save bandwidth when they are unset.

<<RaySamples Public Members>>+=

bool haveMedia; struct { Float uDist, uMode; } media;

Sample values are squirreled away in PixelSampleState rather than being passed along via work queues. This does mean that both storing sample values here and reading them in subsequent kernels is not done with coherent memory accesses, since a thread group’s pixelIndex values will not necessarily be contiguous after the initial camera rays. However, because they are not used in many kernels, passing them along through work queues would entail multiple instances of reading them from one queue just to write them to another, which would be a waste of bandwidth. We have found that the current approach gives marginally better performance in practice.

<<Store RaySamples in pixel sample state>>=

pixelSampleState.samples[w.pixelIndex] = rs;

<<PixelSampleState Public Members>>+=

RaySamples samples;

One shortcoming of the approach implemented in this section is that samples are still generated for rays that do not intersect anything. A more optimized implementation might try to defer sample generation until the specific samples required were known, though the benefits are likely to be marginal: on the GPU, if the samples needed vary over the rays in a thread group, then—given the GPU’s thread group execution model—there may be no savings from skipping the work for some threads if it is still needed by others. Further, sample generation is normally just a few percent of overall rendering time, and so anything more sophisticated is not worth bothering with, at least for pbrt’s requirements.

15.3.6 Intersection Testing

Ray intersections are handled differently depending on whether the wavefront integrator is running on the CPU or the GPU. On the CPU, the ray queues are consumed using ParallelFor() calls and pbrt’s regular acceleration structures from Chapter 7 are used to find intersections. On the GPU, platform-specific functionality is used to do so. In order to abstract the differences between these approaches (and to make it easier to add support for additional GPU architectures), ray intersection work done by the WavefrontPathIntegrator is handled by an implementation of the WavefrontAggregate class, which defines an interface that reflects the integrator’s needs.

<<WavefrontAggregate Definition>>=

class WavefrontAggregate { public: <<WavefrontAggregate Interface>>

virtual ~WavefrontAggregate() = default; virtual Bounds3f Bounds() const = 0; virtual void IntersectClosest(int maxRays, const RayQueue *rayQ, EscapedRayQueue *escapedRayQ, HitAreaLightQueue *hitAreaLightQ, MaterialEvalQueue *basicMtlQ, MaterialEvalQueue *universalMtlQ, MediumSampleQueue *mediumSampleQ, RayQueue *nextRayQ) const = 0; virtual void IntersectShadow(int maxRays, ShadowRayQueue *shadowRayQueue, SOA<PixelSampleState> *pixelSampleState) const = 0; virtual void IntersectShadowTr(int maxRays, ShadowRayQueue *shadowRayQueue, SOA<PixelSampleState> *pixelSampleState) const = 0; virtual void IntersectOneRandom(int maxRays, SubsurfaceScatterQueue *subsurfaceScatterQueue) const = 0;

};

The WavefrontPathIntegrator stores a WavefrontAggregate in its aggregate member variable. The CPU implementation, CPUAggregate, is found in the source files [EntityList: wavefront/cmd: -: aggregate.h] and wavefront/aggregate.cpp. The implementation for NVIDIA GPUs, OptiXAggregate, is found in gpu/aggregate.h and gpu/aggregate.cpp.

<<WavefrontPathIntegrator Member Variables>>+=

WavefrontAggregate *aggregate = nullptr;

All WavefrontAggregates must provide a method that returns the bounds of the entire scene.

<<WavefrontAggregate Interface>>=

virtual Bounds3f Bounds() const = 0;

IntersectClosest() traces a set of rays and finds their closest surface intersections. Beyond the queue that provides the rays to be traced, a number of additional queues must be provided to it. Further work for a ray may be added to multiple queues depending on the specifics of its surface intersection.

<<WavefrontAggregate Interface>>+=

virtual void IntersectClosest(int maxRays, const RayQueue *rayQ, EscapedRayQueue *escapedRayQ, HitAreaLightQueue *hitAreaLightQ, MaterialEvalQueue *basicMtlQ, MaterialEvalQueue *universalMtlQ, MediumSampleQueue *mediumSampleQ, RayQueue *nextRayQ) const = 0;

The WavefrontPathIntegrator’s call to IntersectClosest() mostly passes the corresponding queues from its member variables, though calls to GetCurrentQueue() and NextRayQueue() are necessary to get the appropriate instances of those queues.

<<Find closest intersections along active rays>>=

In order to discuss the responsibilities of IntersectClosest() implementations, we will focus on its implementation in CPUAggregate. Rather than once again repeating the unwieldy list of arguments here, we will proceed directly to the method implementation, which starts with a parallel for loop over the items in the queue.

<<CPUAggregate::IntersectClosest() method implementation>>=

ParallelFor(0, rayQueue->Size(), [=] (int index) { const RayWorkItem r = (*rayQueue)[index]; <<Intersect r’s ray with the scene and enqueue resulting work>>

pstd::optional<ShapeIntersection> si = aggregate.Intersect(r.ray); if (!si) EnqueueWorkAfterMiss(r, mediumSampleQueue, escapedRayQueue); else EnqueueWorkAfterIntersection(r, r.ray.medium, si->tHit, si->intr, mediumSampleQueue, nextRayQueue, hitAreaLightQueue, basicEvalMaterialQueue, universalEvalMaterialQueue);

});

A regular aggregate stored in CPUAggregate handles the ray intersection test, with different cases afterward for rays that have an intersection with a surface and rays that do not.

<<Intersect r’s ray with the scene and enqueue resulting work>>=

The details of enqueuing further work for rays that have no intersections are handled by the EnqueueWorkAfterMiss() function, which is defined in the file wavefront/intersect.h. That header file provides a number of functions that are used by both the CPU- and GPU-based ray intersection code. Gathering them there allows a single implementation to be used by both, which in turn reduces the complexity of those WavefrontAggregate implementations.

Unlike many of the other functions in wavefront/intersect.h, EnqueueWorkAfterMiss() is simple enough that it barely merits its own function—if the ray is passing through participating media, it is enqueued for medium sampling, and otherwise it is enqueued for evaluating infinite light sources’ contribution to its radiance, if there are any in the scene.

<<Wavefront Ray Intersection Enqueuing Functions>>=

void EnqueueWorkAfterMiss(RayWorkItem r, MediumSampleQueue *mediumSampleQueue, EscapedRayQueue *escapedRayQueue) { if (r.ray.medium) mediumSampleQueue->Push(r, Infinity); else if (escapedRayQueue) escapedRayQueue->Push(r); }

For rays that do intersect a surface, there is more to be done. The EnqueueWorkAfterIntersection() function, which is not included in the text, handles all the following details.

If the ray is passing through participating media, it is enqueued for medium sampling. Only if the ray is not absorbed or scattered in the medium sampling kernel is work then queued for the surface intersection to be processed.

If a ray with an associated intersection is not scattered or absorbed by participating media and hits an emissive surface, it is added to a queue so that the surface’s scattered radiance will be added to the ray’s radiance estimate. Rays hitting surfaces are also sorted by the surfaces’ materials and the complexity of their textures into basicEvalMaterialQueue or universalEvalMaterialQueue, both of which are MultiWorkQueues. (Section 15.3.9 describes how the material queues are organized.) Finally, rays that hit surfaces with no materials that represent medium transitions are pushed on to the nextRayQueue to be continued in the next iteration, on the other side of the surface intersection with their medium member variable updated accordingly.

GPU Ray Intersections

Following our custom of not including platform-specific code in the book text, we will not discuss the details of pbrt’s use of CUDA and the OptiX API for its GPU ray-tracing implementation, but we will summarize the abstractions currently used for GPU ray tracing. See the files gpu/optix.h and gpu/optix.cu for details, however.

Current GPU ray intersection APIs follow a different model than the CPU-focused accelerators in Chapter 7. Those accelerators provide a fully procedural model, where the user calls functions that take a single ray and execute synchronously before returning their results. GPU ray tracing is based on a programmable pipeline, where some functionality is provided by the GPU vendor (either in hardware or in software), and some is provided by the user in the form of code that is executed at particular points in the pipeline.

The user-supplied code is in the form of a series of shaders, each of which is a function that is invoked in specific cases. In practice, many instances of these shaders run concurrently, following the GPU’s thread group execution model. Ray tracing starts with a ray generation shader that is responsible for generating a ray and submitting it to the GPU ray-tracing function. In pbrt’s implementation, the ray generation shader retrieves the ray from the RayQueue.

GPUs currently only have native support for intersecting rays with triangles. If a scene has no other types of shapes, then the intersection tests are handled entirely by the GPU’s ray-tracing implementation. For other types of shapes, custom intersection shaders can be provided by the user. The user specifies a shape’s axis-aligned bounding box and associates an intersection shader with it. When a ray intersects that box, the intersection shader is invoked to determine if there is an intersection. If there is, both the parametric value along the ray and a handful of additional user-defined values can be returned to associate with the intersection.

pbrt uses custom intersection shaders for all the quadric shapes as well as for the BilinearPatch. All follow the same pattern. For example, for bilinear patches, the intersection shader calls the previously defined IntersectBilinearPatch() function. Recall that in the event of an intersection, it returns a BilinearIntersection, not a full SurfaceInteraction. This design is intentional, as a BilinearIntersection is much smaller—just 3 Floats. Returning a small representation of the intersection is beneficial for performance, as it reduces how much information must be written to memory.

Alpha testing adds an additional complication to intersection testing (recall the discussion of alpha textures in Section 7.1.1). An any hit shader is applied to any primitives with alpha textures, such as leaves with alpha masks. The any hit shader executes for all intersections with such primitives, before it is known if an intersection is the closest. Our implementation uses the any hit shader to evaluate the alpha texture and apply a stochastic test, just like the <<Possibly ignore intersection based on stochastic alpha test>> fragment in the GeometricPrimitive class used by the CPU. If the test fails, then the GPU is instructed to ignore the intersection completely.

Once ray intersection testing has been completed, one of two shaders is invoked. For rays that have no intersections, a miss shader is called. Otherwise, a closest hit shader is invoked for processing at the intersection point. Now a full SurfaceInteraction is needed. All the shapes that are handled with custom intersection shaders have a method that converts their compact intersection representation to a SurfaceInteraction; for example, it is BilinearPatch::InteractionFromIntersection() for bilinear patches. The GPU reports the barycentric coordinates of triangle intersections, which are sufficient for the Triangle::InteractionFromIntersection() method to do its work.

Given final intersections (or the determination that a ray does not intersect anything), additional work is enqueued using functions from wavefront/intersection.h, as is the case for the wavefront CPU ray-tracing aggregate.

15.3.7 Participating Media

In the interest of space, we will not walk through the code for the kernels launched by the medium sampling method, SampleMediumInteraction(). Algorithmically, it matches the corresponding code in VolPathIntegrator, so we will just summarize the queues and types of work involved.

For any ray with a non-nullptr Medium, the ray intersection code enqueues a MediumSampleWorkItem in the mediumSampleQueue. A first medium-related kernel processes all the entries on this queue. Its task is to call the ray medium’s SampleT_maj() method, adding emission at each sampled point before sampling one of absorption, real scattering, or null scattering. Absorption causes path termination; real scattering causes work to be added to another queue, mediumScatterQueue, that holds MediumScatterWorkItems; and null scattering causes medium sampling to continue. The path throughput and path sampling PDFs are updated along the way. In the end, if the ray is neither scattered nor absorbed, then work is added to queues in the same manner as for rays that are not passing through media in the closest hit shader.

A second kernel is then launched to process all the medium scattering events in the mediumScatterQueue. The usual sampling process ensues: a light and then a point on it are sampled, path sampling PDFs are computed, and a shadow ray is enqueued. Next, the phase function is sampled to generate an indirect ray direction. Work for that ray is then added to the next wavefront depth’s RayQueue via a call to PushIndirectRay().

15.3.8 Ray-Found Emission

Two kernels handle rays that may add emission to their radiance estimates due to their interaction with emissive entities. The first processes rays that have left the scene and the second handles rays that intersect emissive surfaces.

escapedRayQueue is only allocated if the scene has one or more infinite area lights; there is otherwise no work to be done for such rays.

<<WavefrontPathIntegrator Member Variables>>+=

EscapedRayQueue *escapedRayQueue = nullptr;

Note that HandleEscapedRays() returns immediately if there is no queue and thus no infinite area lights, saving the cost of an unnecessary kernel launch in that case.

<<WavefrontPathIntegrator Method Definitions>>+=

void WavefrontPathIntegrator::HandleEscapedRays() { if (!escapedRayQueue) return; ForAllQueued("Handle escaped rays", escapedRayQueue, maxQueueSize, PBRT_CPU_GPU_LAMBDA (const EscapedRayWorkItem w) { <<Compute weighted radiance for escaped ray>>

SampledSpectrum L(0.f); for (const auto &light : *infiniteLights) { if (SampledSpectrum Le = light.Le(Ray(w.rayo, w.rayd), w.lambda); Le) { <<Compute path radiance contribution from infinite light>>

if (w.depth == 0 || w.specularBounce) { L += w.beta * Le / w.r_u.Average(); } else { <<Compute MIS-weighted radiance contribution from infinite light>>

LightSampleContext ctx = w.prevIntrCtx; Float lightChoicePDF = lightSampler.PMF(ctx, light); SampledSpectrum r_l = w.r_l * lightChoicePDF * light.PDF_Li(ctx, w.rayd, true); L += w.beta * Le / (w.r_u + r_l).Average();

}

} }

<<Update pixel radiance if ray’s radiance is nonzero>>

if (L) { L += pixelSampleState.L[w.pixelIndex]; pixelSampleState.L[w.pixelIndex] = L; }

}); }

EscapedRayQueue, not included here, is a WorkQueue of EscapedRayWorkItems. It provides a Push() method that takes a RayWorkItem and copies the values from it that are needed in the kernel.

<<EscapedRayWorkItem Definition>>=

struct EscapedRayWorkItem { <<EscapedRayWorkItem Public Members>>

Point3f rayo; Vector3f rayd; int depth; SampledWavelengths lambda; int pixelIndex; SampledSpectrum beta; int specularBounce; SampledSpectrum r_u, r_l; LightSampleContext prevIntrCtx;

};

The work item for escaped rays stores the ray origin, direction, and depth as well as its wavelengths. The ray’s associated pixel index makes it possible to add any found emission to its radiance estimate.

<<EscapedRayWorkItem Public Members>>=

Point3f rayo; Vector3f rayd; int depth; SampledWavelengths lambda; int pixelIndex;

The kernel’s implementation parallels the <<Accumulate contributions from infinite light sources>> fragment in the VolPathIntegrator, just using the information about the ray from the EscapedRayWorkItem.

<<Compute weighted radiance for escaped ray>>=

SampledSpectrum L(0.f); for (const auto &light : *infiniteLights) { if (SampledSpectrum Le = light.Le(Ray(w.rayo, w.rayd), w.lambda); Le) { <<Compute path radiance contribution from infinite light>>

if (w.depth == 0 || w.specularBounce) { L += w.beta * Le / w.r_u.Average(); } else { <<Compute MIS-weighted radiance contribution from infinite light>>

}

} }

The final result is then added to the ray’s associated PixelSampleState::L value, so long as L is nonzero. If it is zero, skipping the unnecessary update may not lead to fewer instructions being executed, given the GPU’s execution model, but it will save memory bandwidth, which can be just as important to performance.

<<Update pixel radiance if ray’s radiance is nonzero>>=

if (L) { L += pixelSampleState.L[w.pixelIndex]; pixelSampleState.L[w.pixelIndex] = L; }

For infinite area lights that are directly visible or are encountered through specular reflection, MIS is performed using only the unidirectional path PDF, since that is the only way the path could have been sampled.

<<Compute path radiance contribution from infinite light>>=

if (w.depth == 0 || w.specularBounce) { L += w.beta * Le / w.r_u.Average(); } else { <<Compute MIS-weighted radiance contribution from infinite light>>

}

The path throughput beta is needed to weight the light’s contribution. Further, whether or not the previous bounce was due to specular reflection must be tracked. Note that this value only requires a single bit; using a full 32-bit int is wasteful. A more optimized implementation might save some bandwidth by stealing one of the bits from pixelIndex, which does not need all 32 of them.

<<EscapedRayWorkItem Public Members>>+=

SampledSpectrum beta; int specularBounce;

If other types of scattering preceded the ray’s escape, MIS weights are computed using both the light and unidirectional sampling PDFs, following the same approach as is implemented in the <<Add infinite light contribution using both PDFs with MIS>> fragment in the VolPathIntegrator.

<<Compute MIS-weighted radiance contribution from infinite light>>=

In order to compute MIS weights, the EscapedRayWorkItem must provide not only rescaled path probabilities but also geometric information about the previous path scattering vertex.

<<EscapedRayWorkItem Public Members>>+=

SampledSpectrum r_u, r_l; LightSampleContext prevIntrCtx;

The second kernel handles rays that intersect emissive surfaces.

<<WavefrontPathIntegrator Method Definitions>>+=

void WavefrontPathIntegrator::HandleEmissiveIntersection() { ForAllQueued("Handle emitters hit by indirect rays", hitAreaLightQueue, maxQueueSize, PBRT_CPU_GPU_LAMBDA (const HitAreaLightWorkItem w) { <<Find emitted radiance from surface that ray hit>>

SampledSpectrum Le = w.areaLight.L(w.p, w.n, w.uv, w.wo, w.lambda); if (!Le) return;

<<Compute area light’s weighted radiance contribution to the path>>

SampledSpectrum L(0.f); if (w.depth == 0 || w.specularBounce) { L = w.beta * Le / w.r_u.Average(); } else { <<Compute MIS-weighted radiance contribution from area light>>

Vector3f wi = -w.wo; LightSampleContext ctx = w.prevIntrCtx; Float lightChoicePDF = lightSampler.PMF(ctx, w.areaLight); Float lightPDF = lightChoicePDF * w.areaLight.PDF_Li(ctx, wi, true); SampledSpectrum r_u = w.r_u; SampledSpectrum r_l = w.r_l * lightPDF; L = w.beta * Le / (r_u + r_l).Average();

}

<<Update L in PixelSampleState for area light’s radiance>>

L += pixelSampleState.L[w.pixelIndex]; pixelSampleState.L[w.pixelIndex] = L;

}); }

The HitAreaLightQueue stores HitAreaLightWorkItems.

<<HitAreaLightQueue Definition>>=

using HitAreaLightQueue = WorkQueue<HitAreaLightWorkItem>;

<<WavefrontPathIntegrator Member Variables>>+=

HitAreaLightQueue *hitAreaLightQueue = nullptr;

<<HitAreaLightWorkItem Definition>>=

struct HitAreaLightWorkItem { <<HitAreaLightWorkItem Public Members>>

Light areaLight; Point3f p; Normal3f n; Point2f uv; Vector3f wo; SampledWavelengths lambda; int depth; SampledSpectrum beta, r_u, r_l; LightSampleContext prevIntrCtx; int specularBounce; int pixelIndex;

};

The first step in the kernel is to compute the emitted radiance at the ray’s intersection point. If there is none, the kernel can return immediately. Note that it thus could be beneficial if a Light method was added that did a quick conservative test for this case. Such a method could make it possible not to pay the cost of enqueuing work for cases such as a one-sided light source that was intersected on its non-emissive side. With the current implementation, that case is detected only here, costing both bandwidth for the queue work and execution divergence from the threads that return. (Such a method should be simple, deferring more complex tasks like performing texture lookups for surfaces that use image maps to modulate their emission, in order not to harm performance.)

We also note that the call to Light::L() would be a potential source of execution divergence if pbrt had more than one Light implementation that could be used for emissive surfaces. Currently, there is only DiffuseAreaLight, so this is not a concern, but if there were more, it might be worthwhile to have a separate queue for each type of area light in order to avoid this divergence.

<<Find emitted radiance from surface that ray hit>>=

SampledSpectrum Le = w.areaLight.L(w.p, w.n, w.uv, w.wo, w.lambda); if (!Le) return;

The following HitAreaLightWorkItem member variables provide the information necessary to compute the emitted radiance from the intersection point back along the ray.

<<HitAreaLightWorkItem Public Members>>=

Light areaLight; Point3f p; Normal3f n; Point2f uv; Vector3f wo; SampledWavelengths lambda;

The final path contribution is found similarly to how it is for escaped rays and infinite area lights.

<<Compute area light’s weighted radiance contribution to the path>>=

SampledSpectrum L(0.f); if (w.depth == 0 || w.specularBounce) { L = w.beta * Le / w.r_u.Average(); } else { <<Compute MIS-weighted radiance contribution from area light>>

}

Once again, the specularBounce member could be packed in elsewhere in order to save storage and reduce bandwidth requirements.

<<HitAreaLightWorkItem Public Members>>+=

int depth; SampledSpectrum beta, r_u, r_l; LightSampleContext prevIntrCtx; int specularBounce; int pixelIndex;

We will not include the <<Compute MIS-weighted radiance contribution from area light>> fragment here, which closely parallels the corresponding case in the VolPathIntegrator as well as the earlier fragment <<Compute MIS-weighted radiance contribution from infinite light>>.

The final weighted radiance value is then accumulated in the ray’s PixelSampleState.

<<Update L in PixelSampleState for area light’s radiance>>=

L += pixelSampleState.L[w.pixelIndex]; pixelSampleState.L[w.pixelIndex] = L;

The queues for both of these kernels need to be reset at the start of each ray depth iteration, so we will add the corresponding Reset() calls to the queue-resetting fragment defined earlier.

<<Reset queues before tracing next batch of rays>>=

if (escapedRayQueue) escapedRayQueue->Reset(); hitAreaLightQueue->Reset();

Having seen these two kernels, it is fair to ask: why not handle these cases immediately in the ray intersection and medium sampling kernels, rather than incurring the bandwidth costs of queuing up work there and then consuming it here? This is yet another trade-off of bandwidth versus execution convergence. Doing this work in separate kernels for only the cases where it is required means that the kernels both start execution fully converged, with all threads doing useful work. In the intersection and medium scattering kernels, we would generally expect that only a subset of the rays would leave the scene or intersect emissive surfaces. In that case, we would have control divergence and all rays in the thread group that did not intersect an emissive surface would incur a performance cost if even one of the others did. The optimal trade-off depends on both the complexity of the computation to be done in those cases and the amount of bandwidth offered by the GPU.

15.3.9 Surface Scattering

With the WavefrontPathIntegrator, the majority of rendering time is usually spent in the kernels responsible for surface scattering. These kernels are specialized by the surfaces’ materials and, in turn, the type of BxDF that each material uses for the BSDF it returns. Starting from a ray–shape intersection, a surface-scattering kernel handles everything from normal and bump mapping to material evaluation, light and BSDF sampling, and queuing up shadow and indirect rays for later processing.

All of this starts with the Render() method calling EvaluateMaterialsAndBSDFs(), which is implemented in wavefront/surfscatter.cpp. With this method’s implementation, we encounter a new idiom that orchestrates the kernel launches: a call to ForEachType(). In its use here, that function iterates over all of the types that a Material could be and calls a callback function for each one of them. Thus, if pbrt is extended with an additional material, adding that one to the list of materials that Material passes to the TaggedPointer that it inherits from in its declaration in base/material.h automatically leads to a specialized kernel for that material being generated here.

<<WavefrontPathIntegrator Surface Scattering Methods>>=

void WavefrontPathIntegrator::EvaluateMaterialsAndBSDFs( int wavefrontDepth) { ForEachType(EvaluateMaterialCallback{wavefrontDepth, this}, Material::Types()); }

A lambda function is not sufficient to pass to ForEachType(), as it does not pass a value of the given type to the callback but instead invokes its function call operator, passing the type for use in a template specialization. Therefore, we wrap the values needed for the material evaluation method call in a small structure.

<<EvaluateMaterialCallback Definition>>=

struct EvaluateMaterialCallback { int wavefrontDepth; WavefrontPathIntegrator *integrator; <<EvaluateMaterialCallback Public Methods>>

template <typename ConcreteMaterial> void operator()() { if constexpr (!std::is_same_v<ConcreteMaterial, MixMaterial>) integrator->EvaluateMaterialAndBSDF<ConcreteMaterial>(wavefrontDepth); }

};

ForEachType() invokes the following method for each type of material. We skip over the MixMaterial here, since all instances of it are resolved to one of the other material types before being enqueued for the surface-shading kernels. (See Section 10.5.1 for a discussion of this detail of MixMaterial’s usage.)

<<EvaluateMaterialCallback Public Methods>>=

template <typename ConcreteMaterial> void operator()() { if constexpr (!std::is_same_v<ConcreteMaterial, MixMaterial>) integrator->EvaluateMaterialAndBSDF<ConcreteMaterial>(wavefrontDepth); }

The EvaluateMaterialAndBSDF() method does not yet bring us to the point of launching kernels—two considerations are handled beforehand. First, the implementation skips launching the specialized EvaluateMaterialAndBSDF() method for any material types that are not present in the scene. Such work queues will have no entries, so there is no reason to bother with them. This is handled using two arrays, haveBasicEvalMaterial and haveUniversalEvalMaterial, that are initialized in the WavefrontPathIntegrator constructor. (The two further distinguish the materials based on the complexity of their textures, which is a topic that will be discussed immediately after the following fragment.) Both arrays are indexed using the TaggedPointer TypeIndex() method, which returns an integer index for each representable type.

<<WavefrontPathIntegrator Surface Scattering Methods>>+=

template <typename ConcreteMaterial> void WavefrontPathIntegrator::EvaluateMaterialAndBSDF(int wavefrontDepth) { int index = Material::TypeIndex<ConcreteMaterial>(); if (haveBasicEvalMaterial[index]) EvaluateMaterialAndBSDF<ConcreteMaterial, BasicTextureEvaluator>( basicEvalMaterialQueue, wavefrontDepth); if (haveUniversalEvalMaterial[index]) EvaluateMaterialAndBSDF<ConcreteMaterial, UniversalTextureEvaluator>( universalEvalMaterialQueue, wavefrontDepth); }

The WavefrontPathIntegrator maintains two work queues for materials, partitioning them based on the complexity of their textures. Each queue is a MultiWorkQueue with one entry for each material type.

<<WavefrontPathIntegrator Member Variables>>+=

MaterialEvalQueue *basicEvalMaterialQueue = nullptr; MaterialEvalQueue *universalEvalMaterialQueue = nullptr;

These give two more queues to add to the ones that are reset at the start of tracing rays at each wavefront depth.

<<Reset queues before tracing next batch of rays>>+=

basicEvalMaterialQueue->Reset(); universalEvalMaterialQueue->Reset();

Before continuing into the EvaluateMaterialAndBSDF() template specialization, we will detour to discuss texture evaluation in the wavefront integrator in more detail. Recall that when the Material interface was introduced in Section 10.5, it included the notion of a TextureEvaluator. Methods like GetBxDF() and GetBSSRDF() were templated on this type, took an instance of it as a parameter, and used it to evaluate textures rather than calling their Evaluate() methods directly.

There was no point in doing that for CPU rendering: there, a UniversalTextureEvaluator is always used. It immediately forwards texture evaluation requests on to the textures. pbrt’s full set of textures spans a wide range of complexity, however, ranging from trivial constant textures that return a fixed value to complex textures that evaluate noise functions to ones like SpectrumMixTexture that themselves recursively evaluate other textures.

Not only do more complex textures require more registers on the GPU, but the potential for unbounded recursion from the mixture textures requires that the compiler allocate resources to be prepared for that case. In turn, the performance of evaluating the simpler textures can be harmed due to choices the compiler has made for the more complex ones. Because the simpler textures are common, it is thus worthwhile to separate materials according to the complexity of their textures and to have separate kernels for the materials that only use the simpler ones. Doing so can give further benefits from reducing control flow divergence. The following BasicTextureEvaluator class helps with this task.

<<BasicTextureEvaluator Definition>>=

class BasicTextureEvaluator { public: <<BasicTextureEvaluator Public Methods>>

bool CanEvaluate(std::initializer_list<FloatTexture> ftex, std::initializer_list<SpectrumTexture> stex) const { <<Return false if any FloatTextures cannot be evaluated>>

for (FloatTexture f : ftex) if (f && !f.Is<FloatConstantTexture>() && !f.Is<FloatImageTexture>() && !f.Is<GPUFloatImageTexture>()) return false;

<<Return false if any SpectrumTextures cannot be evaluated>>

for (SpectrumTexture s : stex) if (s && !s.Is<SpectrumConstantTexture>() && !s.Is<SpectrumImageTexture>() && !s.Is<GPUSpectrumImageTexture>()) return false;

return true; } Float operator()(FloatTexture tex, TextureEvalContext ctx) { if (tex.Is<FloatConstantTexture>()) return tex.Cast<FloatConstantTexture>()->Evaluate(ctx); else if (tex.Is<FloatImageTexture>()) return tex.Cast<FloatImageTexture>()->Evaluate(ctx); else if (tex.Is<GPUFloatImageTexture>()) return tex.Cast<GPUFloatImageTexture>()->Evaluate(ctx); else return 0.f; } SampledSpectrum operator()(SpectrumTexture tex, TextureEvalContext ctx, SampledWavelengths lambda) { if (tex.Is<SpectrumConstantTexture>()) return tex.Cast<SpectrumConstantTexture>()->Evaluate(ctx, lambda); else if (tex.Is<SpectrumImageTexture>()) return tex.Cast<SpectrumImageTexture>()->Evaluate(ctx, lambda); else if (tex.Is<GPUSpectrumImageTexture>()) return tex.Cast<GPUSpectrumImageTexture>()->Evaluate(ctx, lambda); else return SampledSpectrum(0.f); }

};

We will categorize constant textures and plain image map textures as “basic.” Thus, the BasicTextureEvaluator’s CanEvaluate() method iterates over all provided textures and checks that each is one of those types. When material evaluation work is to be enqueued in the EnqueueWorkAfterIntersection() function, the material’s Material::CanEvaluateTextures() method is called with a BasicTextureEvaluator to determine whether the work is valid for the basic material evaluation queues. If not, it goes on the appropriate universalEvalMaterialQueue.

<<BasicTextureEvaluator Public Methods>>=

bool CanEvaluate(std::initializer_list<FloatTexture> ftex, std::initializer_list<SpectrumTexture> stex) const { <<Return false if any FloatTextures cannot be evaluated>>

for (FloatTexture f : ftex) if (f && !f.Is<FloatConstantTexture>() && !f.Is<FloatImageTexture>() && !f.Is<GPUFloatImageTexture>()) return false;

<<Return false if any SpectrumTextures cannot be evaluated>>

for (SpectrumTexture s : stex) if (s && !s.Is<SpectrumConstantTexture>() && !s.Is<SpectrumImageTexture>() && !s.Is<GPUSpectrumImageTexture>()) return false;

return true; }

The texture types are easily checked with the TaggedPointer::Is() method. (The scene initialization code creates instances of GPUFloatImageTexture in place of FloatImageTextures when GPU rendering is being used. That class uses platform-specific functionality to perform filtered texture look-ups more efficiently than executing Image methods would. GPUSpectrumImageTexture follows equivalently.)

<<Return false if any FloatTextures cannot be evaluated>>=

for (FloatTexture f : ftex) if (f && !f.Is<FloatConstantTexture>() && !f.Is<FloatImageTexture>() && !f.Is<GPUFloatImageTexture>()) return false;

The corresponding fragment for SpectrumTextures is equivalent and is therefore not included here.

The implementation of the TextureEvaluator evaluation method is key to the efficiency benefits provided by this approach. It would do no good to sort materials based on their textures but to then use pbrt’s regular dynamic dispatch mechanism for texture evaluation: in that case, the compiler would have no insight into the fact that the texture being passed to it must be one of the simple types accepted by CanEvaluate().

Therefore, the evaluation method instead tries casting the texture to each of the supported types until it finds the correct one. It then calls the corresponding evaluation method directly. In this way, the compiler can easily tell that the only texture evaluation methods that might be called are those for FloatConstantTexture, FloatImageTexture, and GPUFloatImageTexture, and it does not need to worry about resource allocation for evaluating the more complex texture types.

<<BasicTextureEvaluator Public Methods>>+=

Float operator()(FloatTexture tex, TextureEvalContext ctx) { if (tex.Is<FloatConstantTexture>()) return tex.Cast<FloatConstantTexture>()->Evaluate(ctx); else if (tex.Is<FloatImageTexture>()) return tex.Cast<FloatImageTexture>()->Evaluate(ctx); else if (tex.Is<GPUFloatImageTexture>()) return tex.Cast<GPUFloatImageTexture>()->Evaluate(ctx); else return 0.f; }

The evaluation method for spectrum textures is similar and therefore not included here.

With the motivation for the BasicTextureEvaluator explained, it is possible to better understand why ImageTextureBase in Section 10.4.1 offers scale and invert parameters even though scale and mix textures could be used to achieve the same results. With that capability directly available in textures that can be evaluated by the BasicTextureEvaluator, the UniversalTextureEvaluator can be invoked less often. pbrt actually has a texture rewriting pass that, for example, converts a scale texture with a constant scale applied to an image texture to just an image texture, configured to apply that scale itself. (See, for example, the SpectrumScaledTexture::Create() method implementation for details.)

The following definition of the MaterialEvalQueue type is admittedly complex. However, it is another key to pbrt’s extensibility. For material and BSDF evaluation, we would like to have a MultiWorkQueue where there is a separate queue for each type of Material that pbrt supports, where the type of the work items is the template class MaterialEvalWorkItem, specialized with each material type. While we could enumerate all the currently supported materials in template parameters, to do so would mean that adding a new material to the system would require editing an obscure part of the wavefront integrator implementation for it to be available there as well.

Therefore, we go through some gymnastics in order to define the type of the MultiWorkQueue automatically. First, the TaggedPointer type that Material inherits from includes a type declaration, Types, that holds a TypePack of all types that the pointer can represent. We then use the MapType functionality from Section B.4.3 to wrap each material type inside the forthcoming MaterialEvalWorkItem template class. This gives us the final TypePack of types to provide to MultiWorkQueue.

<<MaterialEvalQueue Definition>>=

using MaterialEvalQueue = MultiWorkQueue<typename MapType<MaterialEvalWorkItem, typename Material::Types>::type>;

<<MaterialEvalWorkItem Definition>>=

template <typename ConcreteMaterial> struct MaterialEvalWorkItem { <<MaterialEvalWorkItem Public Methods>>

PBRT_CPU_GPU NormalBumpEvalContext GetNormalBumpEvalContext(Float dudx, Float dudy, Float dvdx, Float dvdy) const { NormalBumpEvalContext ctx; ctx.p = Point3f(pi); ctx.uv = uv; ctx.dudx = dudx; ctx.dudy = dudy; ctx.dvdx = dvdx; ctx.dvdy = dvdy; ctx.shading.n = ns; ctx.shading.dpdu = dpdus; ctx.shading.dpdv = dpdvs; ctx.shading.dndu = dndus; ctx.shading.dndv = dndvs; ctx.faceIndex = faceIndex; return ctx; } PBRT_CPU_GPU MaterialEvalContext GetMaterialEvalContext(Float dudx, Float dudy, Float dvdx, Float dvdy, Normal3f ns, Vector3f dpdus) const { MaterialEvalContext ctx; ctx.wo = wo; ctx.n = n; ctx.ns = ns; ctx.dpdus = dpdus; ctx.p = Point3f(pi); ctx.uv = uv; ctx.dudx = dudx; ctx.dudy = dudy; ctx.dvdx = dvdx; ctx.dvdy = dvdy; ctx.faceIndex = faceIndex; return ctx; }

<<MaterialEvalWorkItem Public Members>>

const ConcreteMaterial *material; Point3fi pi; Normal3f n; Vector3f dpdu, dpdv; Float time; int depth; Normal3f ns; Vector3f dpdus, dpdvs; Normal3f dndus, dndvs; Point2f uv; SampledWavelengths lambda; int pixelIndex; int anyNonSpecularBounces; Vector3f wo; SampledSpectrum beta, r_u; Float etaScale; MediumInterface mediumInterface;

};

It is crucial to have a pointer to the material in MaterialEvalWorkItem. Note that this can be a pointer to a concrete material type such as DiffuseMaterial due to MaterialEvalWorkItem being a template class on the ConcreteMaterial type. (We will introduce the rest of the member variables of the MaterialEvalWorkItem as they are used in code in the remainder of the section.)

<<MaterialEvalWorkItem Public Members>>=

const ConcreteMaterial *material;

With this context in hand, we can proceed to the implementation of the EvaluateMaterialAndBSDF() method. It is parameterized both by the concrete type of material that is being evaluated and by a TextureEvaluator, which allows us to generate specializations based not only on material but also on the two types of TextureEvaluator.

<<WavefrontPathIntegrator Surface Scattering Methods>>+=

template <typename ConcreteMaterial, typename TextureEvaluator> void WavefrontPathIntegrator::EvaluateMaterialAndBSDF( MaterialEvalQueue *evalQueue, int wavefrontDepth) { <<Get BSDF for items in evalQueue and sample illumination>>

<<Construct desc for material/texture evaluation kernel>>

std::string desc = StringPrintf( "%s + BxDF eval (%s tex)", ConcreteMaterial::Name(), std::is_same_v<TextureEvaluator, BasicTextureEvaluator> ? "Basic" : "Universal");

RayQueue *nextRayQueue = NextRayQueue(wavefrontDepth); auto queue = evalQueue->Get<MaterialEvalWorkItem<ConcreteMaterial>>(); ForAllQueued(desc.c_str(), queue, maxQueueSize, PBRT_CPU_GPU_LAMBDA (const MaterialEvalWorkItem<ConcreteMaterial> w) { <<Evaluate material and BSDF for ray intersection>>

TextureEvaluator texEval; <<Compute differentials for position and at intersection point>>

Vector3f dpdx, dpdy; Float dudx = 0, dudy = 0, dvdx = 0, dvdy = 0; camera.Approximate_dp_dxy(Point3f(w.pi), w.n, w.time, samplesPerPixel, &dpdx, &dpdy); Vector3f dpdu = w.dpdu, dpdv = w.dpdv; <<Estimate screen-space change in >>

<<Compute and its determinant>>

Float ata00 = Dot(dpdu, dpdu), ata01 = Dot(dpdu, dpdv); Float ata11 = Dot(dpdv, dpdv); Float invDet = 1 / DifferenceOfProducts(ata00, ata11, ata01, ata01); invDet = IsFinite(invDet) ? invDet : 0.f;

<<Compute for and >>

Float atb0x = Dot(dpdu, dpdx), atb1x = Dot(dpdv, dpdx); Float atb0y = Dot(dpdu, dpdy), atb1y = Dot(dpdv, dpdy);

<<Compute and derivatives with respect to and >>

dudx = DifferenceOfProducts(ata11, atb0x, ata01, atb1x) * invDet; dvdx = DifferenceOfProducts(ata00, atb1x, ata01, atb0x) * invDet; dudy = DifferenceOfProducts(ata11, atb0y, ata01, atb1y) * invDet; dvdy = DifferenceOfProducts(ata00, atb1y, ata01, atb0y) * invDet;

<<Clamp derivatives of and to reasonable values>>

dudx = IsFinite(dudx) ? Clamp(dudx, -1e8f, 1e8f) : 0.f; dvdx = IsFinite(dvdx) ? Clamp(dvdx, -1e8f, 1e8f) : 0.f; dudy = IsFinite(dudy) ? Clamp(dudy, -1e8f, 1e8f) : 0.f; dvdy = IsFinite(dvdy) ? Clamp(dvdy, -1e8f, 1e8f) : 0.f;

<<Compute shading normal if bump or normal mapping is being used>>

Normal3f ns = w.ns; Vector3f dpdus = w.dpdus; FloatTexture displacement = w.material->GetDisplacement(); const Image *normalMap = w.material->GetNormalMap(); if (normalMap) { <<Call NormalMap() to find shading geometry>>

NormalBumpEvalContext bctx = w.GetNormalBumpEvalContext(dudx, dudy, dvdx, dvdy); Vector3f dpdvs; NormalMap(*normalMap, bctx, &dpdus, &dpdvs); ns = Normal3f(Normalize(Cross(dpdus, dpdvs))); ns = FaceForward(ns, w.n);

} else if (displacement) { <<Call BumpMap() to find shading geometry>>

NormalBumpEvalContext bctx = w.GetNormalBumpEvalContext(dudx, dudy, dvdx, dvdy); Vector3f dpdvs; BumpMap(texEval, displacement, bctx, &dpdus, &dpdvs); ns = Normal3f(Normalize(Cross(dpdus, dpdvs))); ns = FaceForward(ns, w.n);

}

<<Get BSDF at intersection point>>

SampledWavelengths lambda = w.lambda; MaterialEvalContext ctx = w.GetMaterialEvalContext(dudx, dudy, dvdx, dvdy, ns, dpdus); using ConcreteBxDF = typename ConcreteMaterial::BxDF; ConcreteBxDF bxdf = w.material->GetBxDF(texEval, ctx, lambda); BSDF bsdf(ctx.ns, ctx.dpdus, &bxdf); <<Handle terminated secondary wavelengths after BSDF creation>>

if (lambda.SecondaryTerminated()) pixelSampleState.lambda[w.pixelIndex] = lambda;

<<Regularize BSDF, if appropriate>>

if (regularize && w.anyNonSpecularBounces) bsdf.Regularize();

<<Initialize VisibleSurface at first intersection if necessary>>

if (w.depth == 0 && initializeVisibleSurface) { SurfaceInteraction isect; isect.pi = w.pi; isect.n = w.n; isect.shading.n = ns; isect.uv = w.uv; isect.wo = w.wo; isect.time = w.time; isect.dpdx = dpdx; isect.dpdy = dpdy; <<Estimate BSDF’s albedo>>

<<Define sample arrays ucRho and uRho for reflectance estimate>>

constexpr int nRhoSamples = 16; const Float ucRho[nRhoSamples] = { 0.75741637, 0.37870818, 0.7083487, 0.18935409, 0.9149363, 0.35417435, 0.5990858, 0.09467703, 0.8578725, 0.45746812, 0.686759, 0.17708716, 0.9674518, 0.2995429, 0.5083201, 0.047338516 }; const Point2f uRho[nRhoSamples] = { Point2f(0.855985, 0.570367), Point2f(0.381823, 0.851844), Point2f(0.285328, 0.764262), Point2f(0.733380, 0.114073), Point2f(0.542663, 0.344465), Point2f(0.127274, 0.414848), Point2f(0.964700, 0.947162), Point2f(0.594089, 0.643463), Point2f(0.095109, 0.170369), Point2f(0.825444, 0.263359), Point2f(0.429467, 0.454469), Point2f(0.244460, 0.816459), Point2f(0.756135, 0.731258), Point2f(0.516165, 0.152852), Point2f(0.180888, 0.214174), Point2f(0.898579, 0.503897) };

SampledSpectrum albedo = bsdf.rho(isect.wo, ucRho, uRho);

pixelSampleState.visibleSurface[w.pixelIndex] = VisibleSurface(isect, albedo, lambda); }

<<Sample BSDF and enqueue indirect ray at intersection point>>

Vector3f wo = w.wo; RaySamples raySamples = pixelSampleState.samples[w.pixelIndex]; pstd::optional<BSDFSample> bsdfSample = bsdf.Sample_f<ConcreteBxDF>(wo, raySamples.indirect.uc, raySamples.indirect.u); if (bsdfSample) { <<Compute updated path throughput and PDFs and enqueue indirect ray>>

Vector3f wi = bsdfSample->wi; SampledSpectrum beta = w.beta * bsdfSample->f * AbsDot(wi, ns)/bsdfSample->pdf; SampledSpectrum r_u = w.r_u, r_l; <<Update r_u based on BSDF sample PDF>>

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

<<Update etaScale accounting for BSDF scattering>>

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<Apply Russian roulette to indirect ray based on weighted path throughput>>

SampledSpectrum rrBeta = beta * etaScale / r_u.Average(); // Note: depth >= 1 here to match VolPathIntegrator (which increments depth earlier). if (rrBeta.MaxComponentValue() < 1 && w.depth >= 1) { Float q = std::max<Float>(0, 1 - rrBeta.MaxComponentValue()); if (raySamples.indirect.rr < q) { beta = SampledSpectrum(0.f); PBRT_DBG("Path terminated with RR\n"); } else beta /= 1 - q; }

if (beta) { <<Initialize spawned ray and enqueue for next ray depth>>

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

bool anyNonSpecularBounces = !bsdfSample->IsSpecular() || w.anyNonSpecularBounces; LightSampleContext ctx(w.pi, w.n, ns); nextRayQueue->PushIndirectRay( ray, w.depth + 1, ctx, beta, r_u, r_l, lambda, etaScale, bsdfSample->IsSpecular(), anyNonSpecularBounces, w.pixelIndex);

}

<<Sample light and enqueue shadow ray at intersection point>>

BxDFFlags flags = bsdf.Flags(); if (IsNonSpecular(flags)) { <<Choose a light source using the LightSampler>>

LightSampleContext ctx(w.pi, w.n, ns); if (IsReflective(flags) && !IsTransmissive(flags)) ctx.pi = OffsetRayOrigin(ctx.pi, w.n, wo); else if (IsTransmissive(flags) && IsReflective(flags)) ctx.pi = OffsetRayOrigin(ctx.pi, w.n, -wo); pstd::optional<SampledLight> sampledLight = lightSampler.Sample(ctx, raySamples.direct.uc); if (!sampledLight) return; Light light = sampledLight->light;

<<Sample light source and evaluate BSDF for direct lighting>>

pstd::optional<LightLiSample> ls = light.SampleLi(ctx, raySamples.direct.u, lambda, true); if (!ls || !ls->L || ls->pdf == 0) return; Vector3f wi = ls->wi; SampledSpectrum f = bsdf.f<ConcreteBxDF>(wo, wi); if (!f) return;

<<Compute path throughput and path PDFs for light sample>>

SampledSpectrum beta = w.beta * f * AbsDot(wi, ns); Float lightPDF = ls->pdf * sampledLight->p; Float bsdfPDF = IsDeltaLight(light.Type()) ? 0.f : bsdf.PDF<ConcreteBxDF>(wo, wi); SampledSpectrum r_u = w.r_u * bsdfPDF; SampledSpectrum r_l = w.r_u * lightPDF;

<<Enqueue shadow ray with tentative radiance contribution>>

SampledSpectrum Ld = beta * ls->L; Ray ray = SpawnRayTo(w.pi, w.n, w.time, ls->pLight.pi, ls->pLight.n); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

shadowRayQueue->Push(ShadowRayWorkItem{ray, 1 - ShadowEpsilon, lambda, Ld, r_u, r_l, w.pixelIndex});

}

});

}

After some initialization work that includes getting a pointer to the WorkQueue for material evaluation of ConcreteMaterial, the work items in the queue are processed in parallel.

<<Get BSDF for items in evalQueue and sample illumination>>=

<<Construct desc for material/texture evaluation kernel>>

std::string desc = StringPrintf( "%s + BxDF eval (%s tex)", ConcreteMaterial::Name(), std::is_same_v<TextureEvaluator, BasicTextureEvaluator> ? "Basic" : "Universal");

TextureEvaluator texEval; <<Compute differentials for position and at intersection point>>

<<Compute and its determinant>>

Float ata00 = Dot(dpdu, dpdu), ata01 = Dot(dpdu, dpdv); Float ata11 = Dot(dpdv, dpdv); Float invDet = 1 / DifferenceOfProducts(ata00, ata11, ata01, ata01); invDet = IsFinite(invDet) ? invDet : 0.f;

<<Compute for and >>

Float atb0x = Dot(dpdu, dpdx), atb1x = Dot(dpdv, dpdx); Float atb0y = Dot(dpdu, dpdy), atb1y = Dot(dpdv, dpdy);

<<Compute and derivatives with respect to and >>

<<Clamp derivatives of and to reasonable values>>

<<Compute shading normal if bump or normal mapping is being used>>

} else if (displacement) { <<Call BumpMap() to find shading geometry>>

}

<<Get BSDF at intersection point>>

if (lambda.SecondaryTerminated()) pixelSampleState.lambda[w.pixelIndex] = lambda;

<<Regularize BSDF, if appropriate>>

if (regularize && w.anyNonSpecularBounces) bsdf.Regularize();

<<Initialize VisibleSurface at first intersection if necessary>>

<<Define sample arrays ucRho and uRho for reflectance estimate>>

SampledSpectrum albedo = bsdf.rho(isect.wo, ucRho, uRho);

pixelSampleState.visibleSurface[w.pixelIndex] = VisibleSurface(isect, albedo, lambda); }

<<Sample BSDF and enqueue indirect ray at intersection point>>

Vector3f wi = bsdfSample->wi; SampledSpectrum beta = w.beta * bsdfSample->f * AbsDot(wi, ns)/bsdfSample->pdf; SampledSpectrum r_u = w.r_u, r_l; <<Update r_u based on BSDF sample PDF>>

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

<<Update etaScale accounting for BSDF scattering>>

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<Apply Russian roulette to indirect ray based on weighted path throughput>>

if (beta) { <<Initialize spawned ray and enqueue for next ray depth>>

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

}

<<Sample light and enqueue shadow ray at intersection point>>

BxDFFlags flags = bsdf.Flags(); if (IsNonSpecular(flags)) { <<Choose a light source using the LightSampler>>

<<Sample light source and evaluate BSDF for direct lighting>>

<<Compute path throughput and path PDFs for light sample>>

<<Enqueue shadow ray with tentative radiance contribution>>

SampledSpectrum Ld = beta * ls->L; Ray ray = SpawnRayTo(w.pi, w.n, w.time, ls->pLight.pi, ls->pLight.n); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

shadowRayQueue->Push(ShadowRayWorkItem{ray, 1 - ShadowEpsilon, lambda, Ld, r_u, r_l, w.pixelIndex});

}

});

The structure of the material evaluation and sampling kernel parallels that of the surface-scattering–focused part of VolPathIntegrator::Li().

<<Evaluate material and BSDF for ray intersection>>=

TextureEvaluator texEval; <<Compute differentials for position and at intersection point>>

<<Compute and its determinant>>

Float ata00 = Dot(dpdu, dpdu), ata01 = Dot(dpdu, dpdv); Float ata11 = Dot(dpdv, dpdv); Float invDet = 1 / DifferenceOfProducts(ata00, ata11, ata01, ata01); invDet = IsFinite(invDet) ? invDet : 0.f;

<<Compute for and >>

Float atb0x = Dot(dpdu, dpdx), atb1x = Dot(dpdv, dpdx); Float atb0y = Dot(dpdu, dpdy), atb1y = Dot(dpdv, dpdy);

<<Compute and derivatives with respect to and >>

<<Clamp derivatives of and to reasonable values>>

<<Compute shading normal if bump or normal mapping is being used>>

} else if (displacement) { <<Call BumpMap() to find shading geometry>>

}

<<Get BSDF at intersection point>>

if (lambda.SecondaryTerminated()) pixelSampleState.lambda[w.pixelIndex] = lambda;

<<Regularize BSDF, if appropriate>>

if (regularize && w.anyNonSpecularBounces) bsdf.Regularize();

<<Initialize VisibleSurface at first intersection if necessary>>

<<Define sample arrays ucRho and uRho for reflectance estimate>>

SampledSpectrum albedo = bsdf.rho(isect.wo, ucRho, uRho);

pixelSampleState.visibleSurface[w.pixelIndex] = VisibleSurface(isect, albedo, lambda); }

<<Sample BSDF and enqueue indirect ray at intersection point>>

Vector3f wi = bsdfSample->wi; SampledSpectrum beta = w.beta * bsdfSample->f * AbsDot(wi, ns)/bsdfSample->pdf; SampledSpectrum r_u = w.r_u, r_l; <<Update r_u based on BSDF sample PDF>>

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

<<Update etaScale accounting for BSDF scattering>>

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<Apply Russian roulette to indirect ray based on weighted path throughput>>

if (beta) { <<Initialize spawned ray and enqueue for next ray depth>>

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

}

<<Sample light and enqueue shadow ray at intersection point>>

BxDFFlags flags = bsdf.Flags(); if (IsNonSpecular(flags)) { <<Choose a light source using the LightSampler>>

<<Sample light source and evaluate BSDF for direct lighting>>

<<Compute path throughput and path PDFs for light sample>>

<<Enqueue shadow ray with tentative radiance contribution>>

SampledSpectrum Ld = beta * ls->L; Ray ray = SpawnRayTo(w.pi, w.n, w.time, ls->pLight.pi, ls->pLight.n); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

shadowRayQueue->Push(ShadowRayWorkItem{ray, 1 - ShadowEpsilon, lambda, Ld, r_u, r_l, w.pixelIndex});

}

One difference from the VolPathIntegrator is that the wavefront integrator does not carry ray differentials with the rays it traces. Therefore, in order to compute filter widths for texture filtering, the Camera’s Approximate_dp_dxy() method is used to compute approximate differentials at the intersection point. While the resulting filter width estimates are less accurate if there has been specular reflection or transmission from curved surfaces, they usually work well in practice. Note that by declaring local variables dpdu and dpdv at the end of this fragment that store the surface’s partial derivatives, we are able to reuse the earlier fragment <<Estimate screen-space change in >> to compute the texture derivatives.

<<Compute differentials for position and at intersection point>>=

<<Compute and its determinant>>

Float ata00 = Dot(dpdu, dpdu), ata01 = Dot(dpdu, dpdv); Float ata11 = Dot(dpdv, dpdv); Float invDet = 1 / DifferenceOfProducts(ata00, ata11, ata01, ata01); invDet = IsFinite(invDet) ? invDet : 0.f;

<<Compute for and >>

Float atb0x = Dot(dpdu, dpdx), atb1x = Dot(dpdv, dpdx); Float atb0y = Dot(dpdu, dpdy), atb1y = Dot(dpdv, dpdy);

<<Compute and derivatives with respect to and >>

<<Clamp derivatives of and to reasonable values>>

MaterialEvalWorkItem also carries a variety of information about the local geometry of the ray–shape intersection that is copied from the SurfaceInteraction when material evaluation work is enqueued.

<<MaterialEvalWorkItem Public Members>>+=

Point3fi pi; Normal3f n; Vector3f dpdu, dpdv; Float time; int depth;

Before any further work is done, the effect of normal or bump mapping on the local shading geometry is found, if appropriate.

<<Compute shading normal if bump or normal mapping is being used>>=

} else if (displacement) { <<Call BumpMap() to find shading geometry>>

}

MaterialEvalWorkItem carries along information about the initial shading geometry from the intersected shape for a starting point for calls to NormalMap() or BumpMap() and for direct use if there is no normal or bump mapping at the intersection point.

<<MaterialEvalWorkItem Public Members>>+=

Normal3f ns; Vector3f dpdus, dpdvs; Normal3f dndus, dndvs; Point2f uv;

As before, all work related to normal mapping is performed in the NormalMap() function and bump mapping is handled by BumpMap(). Note that here we are able to avoid dynamic dispatch in the calling of the material’s GetDisplacement() and GetNormalMap() methods thanks to ConcreteMaterial being a known specific material type. If a displacement texture or normal map is present, the NormalBumpEvalContext for the call to BumpMap() comes from a MaterialEvalWorkItem utility function that we do not include here; it just initializes a NormalBumpEvalContext using appropriate values from its member variables.

On the GPU, there are a few ways that there may be execution divergence here. First, if some of the threads in the thread group have normal maps or displacement textures and some do not, then all of them will pay the cost of executing the NormalMap() and BumpMap() functions. Second, some threads may have displacement textures and others may have normal maps, which will lead to execution divergence. Finally, the types of the FloatTextures used for defining displacements may vary across the threads. Each of these factors could be used to sort work more finely, though we have not found this divergence to be too much of a problem in practice.

<<Call NormalMap() to find shading geometry>>=

The fragment that calls BumpMap() for bump mapping parallels the one for normal mapping, so it is not included here.

The BSDF is initialized here in a different way than in the VolPathIntegrator. Because the EvaluateMaterialAndBSDF() method is specialized on the material type, and because Materials in pbrt must provide a type definition for their associated BxDF, it is possible to stack-allocate the BxDF here rather than use an instance of the ScratchBuffer class to allocate space for it. It is then possible to call the material’s GetBxDF() method directly rather than using dynamic dispatch through a call to Material::GetBSDF() to do so.

The benefit from avoiding use of the ScratchBuffer for BxDFs is significant: just as stack-allocating the concrete Sampler type in the camera ray generation and sample generation kernels made it possible for the Sampler’s member variables to be stored in GPU registers, this approach does the same for the BxDF, giving a substantial performance benefit compared to storing them in device memory.

The MaterialEvalWorkItem GetMaterialEvalContext() method, which is also not included here, initializes a MaterialEvalContext from its corresponding member variables.

<<Get BSDF at intersection point>>=

if (lambda.SecondaryTerminated()) pixelSampleState.lambda[w.pixelIndex] = lambda;

<<MaterialEvalWorkItem Public Members>>+=

SampledWavelengths lambda;

The call to GetBxDF() above may cause secondary wavelengths to be terminated—for example, in case of dispersion. It is therefore important that the stack-allocated lambda value both be used here in this kernel and also be passed along to subsequent rendering kernels, rather than the initial SampledWavelengths value from the MaterialEvalWorkItem. If the secondary wavelengths in lambda have been terminated, the copy of the SampledWavelengths for the path in PixelSampleState must be updated. Note that the implementation here may end up writing lambda multiple times redundantly at subsequent intersections in that case.

<<Handle terminated secondary wavelengths after BSDF creation>>=

if (lambda.SecondaryTerminated()) pixelSampleState.lambda[w.pixelIndex] = lambda;

<<MaterialEvalWorkItem Public Members>>+=

int pixelIndex;

BSDF regularization proceeds as before, only happening if the option has been enabled and a ray has undergone a non-specular scattering event.

<<Regularize BSDF, if appropriate>>=

if (regularize && w.anyNonSpecularBounces) bsdf.Regularize();

The anyNonSpecularBounces member variable is yet another single-bit quantity taking up 32 bits. A more bandwidth-efficient implementation would pack this value into a free bit elsewhere in the MaterialEvalWorkItem.

<<MaterialEvalWorkItem Public Members>>+=

int anyNonSpecularBounces;

We will omit the fragment that initializes the VisibleSurface at the first intersection, as it is not especially interesting or different than the corresponding code in the CPU-based rendering path.

There are two new things to see in the fragment that samples the BSDF. First, the sample values used come from the RaySamples object rather than from Sampler method calls. Second, because the ConcreteBxDF type is known here at compile time, it is possible to call the templated BSDF::Sample_f() variant that takes the BxDF type and thus avoids dynamic dispatch, calling the appropriate BxDF method implementation directly.

<<Sample BSDF and enqueue indirect ray at intersection point>>=

Vector3f wi = bsdfSample->wi; SampledSpectrum beta = w.beta * bsdfSample->f * AbsDot(wi, ns)/bsdfSample->pdf; SampledSpectrum r_u = w.r_u, r_l; <<Update r_u based on BSDF sample PDF>>

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

<<Update etaScale accounting for BSDF scattering>>

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<Apply Russian roulette to indirect ray based on weighted path throughput>>

if (beta) { <<Initialize spawned ray and enqueue for next ray depth>>

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

}

<<MaterialEvalWorkItem Public Members>>+=

Vector3f wo;

The path throughput and rescaled path probabilities are updated in the same manner as in the VolPathIntegrator <<Update volumetric integrator path state after surface scattering>> fragment.

<<Compute updated path throughput and PDFs and enqueue indirect ray>>=

Vector3f wi = bsdfSample->wi; SampledSpectrum beta = w.beta * bsdfSample->f * AbsDot(wi, ns)/bsdfSample->pdf; SampledSpectrum r_u = w.r_u, r_l; <<Update r_u based on BSDF sample PDF>>

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

<<Update etaScale accounting for BSDF scattering>>

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<Apply Russian roulette to indirect ray based on weighted path throughput>>

if (beta) { <<Initialize spawned ray and enqueue for next ray depth>>

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

}

Only the unidirectional rescaled path probability needs to be passed in to this kernel, since the light path probability is initialized during its execution.

<<MaterialEvalWorkItem Public Members>>+=

SampledSpectrum beta, r_u;

The logic for updating the unidirectional rescaled path probability also follows that of the VolPathIntegrator, including how proportional BSDF samples such as those from the LayeredBxDF are handled. The call to the BSDF::PDF() method presents another opportunity to avoid dynamic dispatch given a compile-time known BxDF, however.

<<Update r_u based on BSDF sample PDF>>=

if (bsdfSample->pdfIsProportional) r_l = r_u / bsdf.PDF<ConcreteBxDF>(wo, bsdfSample->wi); else r_l = r_u / bsdfSample->pdf;

The etaScale factor used in Russian roulette is also updated in the same manner as before.

<<Update etaScale accounting for BSDF scattering>>=

Float etaScale = w.etaScale; if (bsdfSample->IsTransmission()) etaScale *= Sqr(bsdfSample->eta);

<<MaterialEvalWorkItem Public Members>>+=

Float etaScale;

We will omit the <<Apply Russian roulette to indirect ray based on weighted path throughput>> fragment, which is almost exactly the same as the VolPathIntegrator’s <<Possibly terminate volumetric path with Russian roulette>>, other than the fact that in this case the sample value for the Russian roulette test comes from RaySamples.indirect.rr.

For indirect rays, a ray is initialized and pushed on to the RayQueue for the next level of the ray tree. (We omit the implementation of the RayQueue::PushIndirectRay() method, which stores the provided values in the corresponding member variables.)

<<Initialize spawned ray and enqueue for next ray depth>>=

Ray ray = SpawnRay(w.pi, w.n, w.time, wi); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

The medium member of the Ray must be set manually based on which side of the surface the ray starts in.

<<Initialize ray medium if media are present>>=

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

The MediumInterface at the intersection point is thus needed in MaterialEvalWorkItem.

<<MaterialEvalWorkItem Public Members>>+=

MediumInterface mediumInterface;

Indirect rays require a few additions to the RayWorkItem class including the current path throughput , rescaled path probabilities, and the information necessary to compute MIS weights if the ray encounters emission.

<<RayWorkItem Public Members>>+=

SampledSpectrum beta, r_u, r_l; LightSampleContext prevIntrCtx; Float etaScale; int specularBounce; int anyNonSpecularBounces;

The direct lighting calculation follows a similar path as in the VolPathIntegrator.

<<Sample light and enqueue shadow ray at intersection point>>=

BxDFFlags flags = bsdf.Flags(); if (IsNonSpecular(flags)) { <<Choose a light source using the LightSampler>>

<<Sample light source and evaluate BSDF for direct lighting>>

<<Compute path throughput and path PDFs for light sample>>

<<Enqueue shadow ray with tentative radiance contribution>>

SampledSpectrum Ld = beta * ls->L; Ray ray = SpawnRayTo(w.pi, w.n, w.time, ls->pLight.pi, ls->pLight.n); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

shadowRayQueue->Push(ShadowRayWorkItem{ray, 1 - ShadowEpsilon, lambda, Ld, r_u, r_l, w.pixelIndex});

}

We will omit the <<Choose a light source using the LightSampler>> and <<Sample light source and evaluate BSDF for direct lighting>> fragments, as they are essentially the same as corresponding fragments that we have seen in earlier integrators. The first fragment gives us the sampledLight variable that stores a SampledLight and the second calls Light::SampleLi() with that light, giving an ls variable that stores the LightLiSample.

The path throughput and path sampling weights are computed in the same way as they are in the VolPathIntegrator::SampleLd() method, though here it is again possible to use the templated BSDF::PDF() method that avoids the dynamic dispatch overhead to compute the BSDF’s PDF.

<<Compute path throughput and path PDFs for light sample>>=

We go ahead and compute Ld, which is the final weighted contribution that the shadow ray would give to the pixel sample’s radiance estimate if it is unoccluded. If the ray is unoccluded and there are participating media in the scene, Ld may still be reduced due to the effect of extinction along the shadow ray; that factor is handled when the shadow ray is traced.

<<Enqueue shadow ray with tentative radiance contribution>>=

SampledSpectrum Ld = beta * ls->L; Ray ray = SpawnRayTo(w.pi, w.n, w.time, ls->pLight.pi, ls->pLight.n); <<Initialize ray medium if media are present>>

if (haveMedia) ray.medium = Dot(ray.d, w.n) > 0 ? w.mediumInterface.outside : w.mediumInterface.inside;

shadowRayQueue->Push(ShadowRayWorkItem{ray, 1 - ShadowEpsilon, lambda, Ld, r_u, r_l, w.pixelIndex});

15.3.10 Shadow Rays

The set of shadow rays to be traced after material evaluation and the direct lighting calculation in the EvaluateMaterialAndBSDF() method are stored in a ShadowRayQueue.

<<ShadowRayQueue Definition>>=

using ShadowRayQueue = WorkQueue<ShadowRayWorkItem>;

A single ShadowRayQueue is maintained by the WavefrontPathIntegrator.

<<WavefrontPathIntegrator Member Variables>>+=

ShadowRayQueue *shadowRayQueue = nullptr;

The work items in the queue store values that include the ray to be traced, its wavelengths, and tentative contribution Ld, as well as the rescaled path probabilities up to the ray’s origin.

<<ShadowRayWorkItem Definition>>=

struct ShadowRayWorkItem { Ray ray; Float tMax; SampledWavelengths lambda; SampledSpectrum Ld, r_u, r_l; int pixelIndex; };

The WavefrontAggregate interface includes two methods for tracing shadow rays: IntersectShadow(), which is called for scenes that have no participating media where only a binary visibility test is needed, and IntersectShadowTr() for scenes that do have media and where transmittance must be computed. A call to the WavefrontPathIntegrator::TraceShadowRays() method leads to a call of the appropriate method; the shadow ray queue is reset immediately afterward.

<<WavefrontAggregate Interface>>+=

virtual void IntersectShadow(int maxRays, ShadowRayQueue *shadowRayQueue, SOA<PixelSampleState> *pixelSampleState) const = 0; virtual void IntersectShadowTr(int maxRays, ShadowRayQueue *shadowRayQueue, SOA<PixelSampleState> *pixelSampleState) const = 0;

The CPU implementation of IntersectShadow() processes items from the queue in parallel and calls its aggregate’s IntersectP() method to determine if each ray is occluded. A call to RecordShadowRayResult(), another utility function shared by both CPU and GPU aggregates that is defined in wavefront/intersect.h, takes care of the additional work to be done after the intersection test has been resolved.

<<CPUAggregate Method Definitions>>=

void CPUAggregate::IntersectShadow( int maxRays, ShadowRayQueue *shadowRayQueue, SOA<PixelSampleState> *pixelSampleState) const { <<Intersect shadow rays from shadowRayQueue in parallel>>

ParallelFor(0, shadowRayQueue->Size(), [=] (int index) { const ShadowRayWorkItem w = (*shadowRayQueue)[index]; bool hit = aggregate.IntersectP(w.ray, w.tMax); RecordShadowRayResult(w, pixelSampleState, hit); });

}

<<Intersect shadow rays from shadowRayQueue in parallel>>=

If the ray was occluded, then no further work is necessary. Otherwise, the final MIS-weighted radiance is found by dividing by the two rescaled path probabilities and is then added to the running sum of the radiance estimate in the PixelSampleState for the ray.

<<Wavefront Ray Intersection Enqueuing Functions>>+=

void RecordShadowRayResult(const ShadowRayWorkItem w, SOA<PixelSampleState> *pixelSampleState, bool foundIntersection) { if (foundIntersection) return; SampledSpectrum Ld = w.Ld / (w.r_u + w.r_l).Average(); SampledSpectrum Lpixel = pixelSampleState->L[w.pixelIndex]; pixelSampleState->L[w.pixelIndex] = Lpixel + Ld; }

In the presence of participating media, there is more work to do for shadow rays. The computation proceeds just as in the latter two thirds of the VolPathIntegrator::SampleLd() method: the ray is successively intersected against the scene geometry, terminating if it hits an opaque surface. Otherwise a call to SampleT_maj() samples the medium, so that transmittance can be estimated using ratio tracking just as in <<Update transmittance for current ray segment>> in that method. Given the close similarities, we will not include that code for the wavefront integrator here.

15.3.11 Updating the Film

Each pixel sample’s value is provided to the Film only after all of them are finished. One might wonder: why not push samples on to a queue whenever their ray paths terminate and add film samples sooner by periodically running a kernel that processes the items on the queue? There are two costs to such an approach: the first is that samples would be added to the film in an arbitrary order, so the accesses to values like PixelSampleState::L would be irregular across threads in thread groups, which could harm performance.

The second cost is the unnecessary bandwidth that would be used to push work onto the queue and to process it. We know that every pixel sample taken at the start of tracing each path will eventually be added to the film, so there is no need to separately enumerate them. For both of these reasons, it is more efficient to use a plain GPUParallelFor() loop and process the PixelSampleState structures in order.

<<WavefrontPathIntegrator Film Methods>>=

void WavefrontPathIntegrator::UpdateFilm() { ParallelFor("Update film", maxQueueSize, PBRT_CPU_GPU_LAMBDA (int pixelIndex) { <<Check pixel against film bounds>>

Point2i pPixel = pixelSampleState.pPixel[pixelIndex]; if (!InsideExclusive(pPixel, film.PixelBounds())) return;

<<Compute final weighted radiance value>>

SampledSpectrum Lw = SampledSpectrum(pixelSampleState.L[pixelIndex]) * pixelSampleState.cameraRayWeight[pixelIndex];

<<Provide sample radiance value to film>>

SampledWavelengths lambda = pixelSampleState.lambda[pixelIndex]; Float filterWeight = pixelSampleState.filterWeight[pixelIndex]; if (initializeVisibleSurface) { <<Call Film::AddSample() with VisibleSurface for pixel sample>>

VisibleSurface visibleSurface = pixelSampleState.visibleSurface[pixelIndex]; film.AddSample(pPixel, Lw, lambda, &visibleSurface, filterWeight);

} else film.AddSample(pPixel, Lw, lambda, nullptr, filterWeight);

}); }

Recall that the GenerateCameraRays() method sets the PixelSampleState::pPixel coordinates for all the threads, but that it then exits immediately for threads corresponding to extra scanlines beyond the end of the image. Therefore, we must perform the same check of the pixel coordinates against the film bounds here, and not do any further processing for such pixels.

<<Check pixel against film bounds>>=

Point2i pPixel = pixelSampleState.pPixel[pixelIndex]; if (!InsideExclusive(pPixel, film.PixelBounds())) return;

The final radiance value is scaled by the camera ray weight returned by the Camera before it is provided to the Film.

<<Compute final weighted radiance value>>=

SampledSpectrum Lw = SampledSpectrum(pixelSampleState.L[pixelIndex]) * pixelSampleState.cameraRayWeight[pixelIndex];

A few more values read from pixelSampleState and we have everything that we need to call the Film’s AddSample() method. The implementation uses the initializeVisibleSurface member variable, set in the constructor, to distinguish between Film implementations that make use of the VisibleSurface and those that do not in order to save the memory bandwidth of reading it if it will not be used.

<<Provide sample radiance value to film>>=

VisibleSurface visibleSurface = pixelSampleState.visibleSurface[pixelIndex]; film.AddSample(pPixel, Lw, lambda, &visibleSurface, filterWeight);

} else film.AddSample(pPixel, Lw, lambda, nullptr, filterWeight);

One interesting detail is that the VisibleSurface must be loaded from SOA format into the regular VisibleSurface structure layout so that a pointer to it can be passed to the AddSample() method; a pointer to its current in-memory representation does not correspond to the layout that this method expects.

<<Call Film::AddSample() with VisibleSurface for pixel sample>>=

VisibleSurface visibleSurface = pixelSampleState.visibleSurface[pixelIndex]; film.AddSample(pPixel, Lw, lambda, &visibleSurface, filterWeight);