15.3 Path Tracer Implementation
With these utility classes in hand, we can turn to the WavefrontPathIntegrator class implementation. As mentioned earlier, its functionality matches that of the VolPathIntegrator, restructured to run with a wavefront architecture.
The WavefrontPathIntegrator class declaration is in the file wavefront/integrator.h and some of its implementation is in wavefront/integrator.cpp, though a number of its method implementations are distributed across separate source files in the wavefront/ directory. While this is a different organization than we have used elsewhere in pbrt (where all the non-inline method definitions for a class defined in a file named file.h are in file.cpp), distributing them in this way reduces the time necessary to compile pbrt, since many of the methods make use of C++ features that can lead to long compile times. Spreading them out across multiple files allows multiple CPU cores to compile their implementations in parallel.
The constructor converts the provided BasicScene into the objects that represent the scene for rendering. We will skip over the majority of its implementation here, however, as most of it just calls all the appropriate object Create() methods to allocate and initialize the corresponding objects (see Section C.3). All allocations are performed with Allocators that use the provided memory resource. When rendering on the GPU, this leads to the use of managed memory.
These are some of the key scene objects that are initialized in the constructor. As with the CPU-based Integrators, the infinite lights are stored independently so that they can be efficiently iterated over for rays that escape the scene.
There are a few additional member variables to store what should now be familiar configuration options.
In order to limit memory use, this integrator places a limit on the number of pixel samples that it works on at once. All told, each active pixel sample requires roughly 1,000 bytes of additional storage for its state variables and to ensure that all the queues have sufficient space for the work to be done for the ray. (The actual amount of storage varies based on the scene, as the constructor is careful not to allocate work queues for the volume scattering kernels if there is no participating media in the scene, for example.)
The following fragment from the WavefrontPathIntegrator constructor therefore sets a maximum number of active samples and then determines how many scanlines of pixels that corresponds to. That value in turn determines how many passes (of rendering that many scanlines) are necessary to cover the full image resolution. Finally, scanlinesPerPass is set with a new value that evens out the number of scanlines rendered in each pass. This can help with load balancing by avoiding having a small number of scanlines in the final pass.
All the queues that are allocated to buffer work between kernels are also allocated to be able to store maxQueueSize individual work items. Thus, all queues are able to store one work item for each of the active pixel samples. This is a sufficient number, as none of the kernels in the current implementation ever push more than one item on a queue for each item processed. However, it may waste a substantial amount of memory. For example, there is a work queue for rays that hit emissive surfaces in the scene. It is rare that all the rays will hit emissive objects, yet there must be storage for all in case of that eventuality. The alternative, dynamically increasing the size of the queues when necessary, would be difficult to implement efficiently in the context of the massive parallelism on GPUs.
The WavefrontPathIntegrator does not use only queues to provide values to kernels and to communicate results. While the queue model is elegant, it can be inefficient if some values are computed early and not used until much later. For example, consider the VisibleSurface structure that is provided to Film implementations like the GBufferFilm: it is initialized at the first intersection after the camera, but then its value is not used again until the full ray path has been traced and the Film::AddSample() method is called at the end. VisibleSurface is, further, a relatively large structure. In a purely queue-based model, a substantial amount of memory bandwidth would be consumed passing it along from kernel to kernel until the end.
Therefore, the PixelSampleState structure is used for storing all such values. Various member variables will be added to it in what follows.
The WavefrontPathIntegrator maintains an SOA-arranged array of PixelSampleStates, allocated to have maxQueueSize entries. Each sample’s index into this array is carried through the rendering computation so that it is easy to determine which entry corresponds to a pixel sample being processed.
15.3.1 Work Launch
Because the WavefrontPathIntegrator may be running either on the CPU or on the GPU, it provides methods for launching work that use the appropriate processor. Each selects the appropriate type of processor based on the renderer’s configuration.
The first, ParallelFor(), selects between the types of two parallel for loops.
The Do() method executes the provided function in a single thread. On the CPU, it is no different than a regular function call; on the GPU, however, it executes the provided function using GPUParallelFor() with a single-item loop. For reasons that should be clear by now, this not a good way to do a meaningful amount of work on the GPU, but this capability is necessary for resetting counters, clearing queues, and the like.
15.3.2 The Render() Method
Rendering is initiated by a call to Render(). (Recall Figure 15.2 in Section 15.1.3, which summarizes the kernels that this method launches.) Similar to earlier integrators, it progressively takes more samples in all pixels until the requested number of samples have been taken. This method tracks how long rendering takes and returns the number of elapsed seconds; we have not included here the straightforward few lines of code that handle that.
One subtlety is that the initialization of the pixelBounds variable at the start is important for rendering performance. It will be necessary to have this value later on in the implementation of Render()—though because the Film is stored in managed memory, calling the PixelBounds() method after GPU kernels have been launched could incur the overhead of copying data back to the CPU.
By default, the number of samples taken in each pixel is the number determined by the Sampler; the value returned by its SamplesPerPixel() method is cached in a member variable for the same reason as pixelBounds is cached above. It is possible to limit rendering to a single specified sample index using the –debugstart command line option; the code to set firstSampleIndex and lastSampleIndex in that case is not included here.
Given a sample index, the next step is to loop over the one or more chunks of scanlines and to evaluate a sample in each of their pixels. This code is also similar to the corresponding code in the VolPathIntegrator: evaluating each sample starts with generating a camera ray and then following it through multiple intersections until it is terminated or the maximum depth is reached. The key difference is that, here, these tasks are being performed for as many as a million or so pixel samples at a time.
Before the rays are generated, it is necessary to reset the work queue that will store them. We will need to maintain more than one ray queue: one for the set of rays currently being traced and another for the indirect rays that have been spawned to be traced at the next depth. The CurrentRayQueue() method, defined shortly, returns the queue for the specified depth.
When the GPU is being used for rendering, it is critically important that the Reset() method is called from the GPU and not the CPU; Do() is thus used here. Not only could resetting it from the CPU be inefficient, as doing so would involve the CPU writing to managed memory, but—given the asynchronous execution of the GPU—it would almost certainly be incorrect, potentially resetting a queue that was still in use by the code that was executing on the GPU.
The WavefrontPathIntegrator maintains a pair of ray queues, rather than allocating one for each ray depth up to the maximum. It manages them using double buffering: one stores the current active set of rays and should only be read from, while the other stores the rays enqueued to trace at the next depth and should only be written to. At each successive ray depth, the roles of the two queues are switched.
Two convenience methods return pointers to the RayQueues given a depth. The double buffering logic is effectively encapsulated in their implementations.
15.3.3 Generating Camera Rays
The two methods related to generating camera rays are implemented in the file wavefront/camera.cpp, not in wavefront/integrator.cpp, because we have used C++ templates to generate multiple specialized instances of the methods, each one specialized based on some of the object types involved. As mentioned earlier, techniques like this can cause lengthy compile times, so it is worthwhile to isolate the camera method implementations in their own source file.
There are two GenerateCameraRays() methods. The first, which is called by WavefrontPathIntegrator::Render(), determines the concrete type of the Sampler being used and then calls the second GenerateCameraRays() method, which is templated on the type of the Sampler. This idea is something that we will see again in other methods: CPU code making a runtime determination of the types of objects involved in a computation, which allows the execution of code that is specialized for those types. This is especially beneficial to performance when running on the GPU.
The first method defines a lambda function, generateRays, and then invokes the TaggedPointer’s dynamic dispatch mechanism, which will end up calling the lambda function using the concrete type of the Sampler. (In this case, we use the DispatchCPU() variant, which must be used for code that can only execute on the CPU.)
Using auto in the parameter declaration for the following lambda function causes it to be parameterized by the type of sampler, like a template function. (C++20 provides a less obscure syntax for templated lambda functions, though this version of pbrt limits itself to C++17.) Thus, the type of sampler will be a concrete instance of one of the sampler types provided in the Sampler declaration in Section 8.3.
There is a nit in that sampler is passed as a reference to a pointer to the specific Sampler type; a bit of work in the using declaration gives us the actual Sampler type. A second nit is that we must filter out the MLTSampler, which is only used by the MLTIntegrator, and DebugMLTSampler, a variant of it that is used for debugging. Those classes use vector methods like push_back() that are not supported in GPU code, and therefore we must make clear to the compiler what we know in any case: those samplers will not be used here.
With the concrete sampler type in hand, the second GenerateCameraRays() method can be called, with the sampler type provided for the template specialization.
We can now move on to the second GenerateCameraRays() method, which calls WavefrontPathIntegrator::ParallelFor() to generate all of the rays in parallel.
The sequence of operations performed for each camera ray again generally matches what we have seen before: after computing the pixel coordinates for the provided loop index, samples are generated, a set of wavelengths are sampled, and then the Camera generates the ray.
Given the pixel bounds of the film, the pixelIndex value is mapped to pixel coordinates, starting from the lower bound given by the pixel bounds and y0 value, respectively. Threads are then assigned to pixels in scanline order. Because the PBRT_GPU_LAMBDA macro includes *this in the lambda capture, various WavefrontPathIntegrator member variables such as film, which is used here, can be directly accessed in the lambda function.
The pixel coordinates are then stored in PixelSampleState. Note that if we are just setting a single member variable of an SOA structure rather than assigning an entire structure value, then the member variable must be indexed and not the pixelSampleState structure itself; our automatically generated SOA classes are not able to provide the same syntax as would be used for an array of structures layout in that case.
The pixel coordinates corresponding to a sample are our first addition to the PixelSampleState structure.
If the image has been split into multiple spans of scanlines, then the number of parallel loop iterations in the final pass may be a few more than there are pixels left to be sampled. Rather than worry about this when specifying the number of threads to launch in the call to WavefrontPathIntegrator::ParallelFor(), we just check this condition in the kernel. Indices for out of bounds pixels return after initializing PixelSampleState::pPixel and do not enqueue any rays.
Next, samples are needed to generate the camera ray. Here we can see the value of specializing this method based on the ConcreteSampler type. The following fragment is perhaps best understood by reading it with the ConcreteSampler in the following replaced with, say, HaltonSampler in your head. pixelSampler then is a stack-allocated HaltonSampler, and the TaggedPointer::Cast() call gives us a pointer of that type; dereferencing it lets each thread initialize its own sampler from the exemplar of the one that the WavefrontPathIntegrator stores. All threads access the same memory locations, so reading the Sampler from memory does not consume much bandwidth.
The benefit from this approach comes from having the sampler allocated on the stack. On the GPU, its member variables can be stored directly in registers, giving high performance when executing its methods. As an additional bonus, there is no overhead for dynamic dispatch in the StartPixelSample() call: the sampler’s type is known, so the appropriate method can be called directly. It will usually be expanded inline at the call site.
Wavelengths are sampled in precisely the same way as they are in most of the other integrators.
The CameraSample can be initialized in the usual way and then it is on to call Camera::GenerateRay(). In this case, this method call is also resolved using pbrt’s usual dynamic dispatch system, just as it is in all the other integrators. An alternative implementation approach would be to further specialize the GenerateCameraRays() method based on the type of the Camera being used; after all, it is the same for all pixel samples and so it is somewhat wasteful for all threads to perform the same computations for dynamic dispatch. We have found that in practice that alternative gives a negligible performance benefit, and so our implementation here remains based on dynamic dispatch.
One other thing to note is that the wavefront integrator path generates regular Rays here and not RayDifferentials. Approximate differentials for filtering will be computed later, trading off superior antialiasing quality in exchange for higher performance from reduced memory bandwidth due to having less information associated with each ray. An exercise at the end of the chapter revisits this choice.
A few additional values that are stored in PixelSampleState can now be set. We avoid initializing the heavyweight VisibleSurface member if it is not going to be used by the Film; the initializeVisibleSurface member variable is set in the WavefrontPathIntegrator constructor to record whether it is needed. Saving this memory bandwidth when possible is worth this easy check.
Of these member variables, we will see L most often in what follows. It stores the accumulated radiance estimate for the ray path. After the path terminates, its value will be provided to the Film.
If the Camera successfully generated a ray, it is pushed into the RayQueue along with its wavelengths and associated pixel index, which allows subsequent kernels to be able to index into WavefrontPathIntegrator::pixelSampleState to retrieve values there that are associated with this ray.
Some of the work queues provide specialized methods for pushing work on to them; RayQueue is one of them. It provides a PushCameraRay() method for camera rays and PushIndirectRay() for scattered rays that sample indirect lighting. Doing so not only makes the code where work is pushed on to the queue slightly cleaner, but it also makes it possible to set some of the forthcoming RayWorkItem member variables to default values for camera rays without needing to specify default values here. (For example, RayWorkItem carries path sampling PDFs that are initialized to 1 for camera rays.) We will pass over the implementation of this method here, as it is all setting member variables, either from passed-in values or from defaults.
We will start the definition of the RayWorkItem structure here.
As would be expected from the parameters passed to PushCameraRay(), RayWorkItem stores a ray, its wavelengths, and an index into the WavefrontPathIntegrator’s pixelSampleState array from which additional data associated with the ray can be found.
The reader may have noted that lambda is both pushed to the ray queue and stored in the PixelSampleState, which is admittedly redundant, though there are good reasons for both. Its value must be stored in the PixelSampleState structure, as it is required for updating the film, which is not scheduled using work queues but via a loop over the PixelSampleState values; see further discussion of this design in Section 15.3.11.
However, if other kernels along the way accessed lambda via PixelSampleState, performance would be poor, since the initial correlation between the value of pixelIndex and threads in a thread group quickly becomes shuffled up over the course of rendering. Thus, reads from PixelSampleState would be highly incoherent. Because it is used in just about every kernel, lambda is also passed along throughout the queues of the wavefront integrator. Since it is in work queues, the loads and stores to read and save its value are coherent across the thread group, giving good performance.
15.3.4 Loop over Ray Depths
With the camera having seeded the ray queue with an initial set of rays, we can now turn to the loop that executes once for each successive set of ray intersection tests and shading calculations up until the maximum ray path length. One subtlety is that this loop is over “wavefront depth,” which is different than the ray depth that has been tracked in other integrators. Here, the wavefront depth reflects the number of times that this loop has executed, tracing a batch of rays and processing their intersections. Each ray tracks its own depth, which is usually the same as the wavefront depth, though the depth of rays that hit invisible surfaces that delineate the boundaries between volumetric media is not incremented at those intersections (as in other integrators).
This loop always runs on the CPU; when the GPU is used for rendering, it is responsible for launching the appropriate kernels to perform the rendering computation. An implication of this design and the decoupling of the CPU and GPU is that the CPU has no visibility into the state of the ray-tracing calculations on the GPU. For example, if all rays terminate after the first intersection, the CPU will continue submitting kernel launches to execute the ray-tracing pipeline even though no work is passing through it. All the kernels would exit immediately, though there would be a cost from launching all of them and their determining that there is no work to be done. For many scenes this is not a problem, but it can harm this integrator’s performance with high maximum ray depths. An exercise at the end of the chapter outlines some design alternatives.
All the work queues except for the RayQueue holding the current set of rays must be cleared at the start of each iteration. The following fragment clears the RayQueue for the next set of indirect rays and adds a fragment for the rest of the queues. We will add additional Reset() calls to this fragment for the various other queues along with the code that defines and uses them.
The closest intersection with a surface, if any, is then found for each ray.
Before surface scattering or emission is considered, the medium (if any) is sampled. If the medium scatters or absorbs the ray, then any surface intersection will be ignored.
Only after medium sampling are rays that left the scene taken care of. In this way, rays passing through participating media that do not interact with it and then leave the scene can be processed at the same time as rays that left the scene but did not pass through media.
The contribution of emissive surfaces to a sample’s radiance value is also only added after medium sampling, since it should not be included if the medium scattered or absorbed the ray.
The loop over wavefront depth can only be terminated after emissive surfaces have been accounted for, since the MIS-based direct lighting calculation accounts for their contribution.
Next, the shadow rays enqueued by the material evaluation kernel are traced; the radiance contributions of the unoccluded ones are accumulated in PixelSampleState.
The following sections go into these steps in more detail.
15.3.5 Sample Generation
The first step in each loop iteration is to generate values for all the sample dimensions that may be required for sampling lighting if a scattering event is found along the ray, either in a participating medium or on a surface. Generating all of these samples ahead of time in a separate kernel allows for specializing that kernel based on the sampler type, giving the same performance benefits as were found in GenerateCameraRays().
The GenerateRaySamples() method is defined in the file wavefront/samples.cpp. Similar to camera rays, there is an initial GenerateRaySamples() method called in the main rendering loop that determines the actual Sampler type and invokes the appropriate specialization. Because this dispatch method is nearly the same as the analog in GenerateCameraRays(), we omit it here and turn directly to the specialized method.
Unlike the CPU-only integrators, where sample dimensions are allocated implicitly based on the runtime sequence of Get1D() and Get2D() method calls, here dimensions are explicitly allocated, ensuring that unique dimensions are assigned to different uses of samples. Doing so imposes a coupling between this kernel and the use of samples in following ones and also means that samples may be generated that are not actually used (e.g., if a perfect specular surface is intersected). These costs are worth paying in return for the performance benefits of generating samples in a specialized kernel.
The first task is to find the first dimension to allocate for this ray’s samples. The first 5 sample dimensions are used to generate the camera ray, and then 1 is used to sample the wavelengths. At least 7 samples are needed for each ray depth: 3 for the call to BSDF::Sample_f(), 1 to sample a light source and then 2 to sample a position on it, and then 1 more for Russian roulette for the indirect ray. Two additional dimensions are consumed for sampling participating media. (The haveMedia WavefrontPathIntegrator member variable is set in its constructor based on the scene.)
This kernel uses the same trick to get a stack-allocated Sampler as was done for camera rays. In this case, figuring out which pixel the ray is associated with requires a read from the PixelSampleState.
The RaySamples structure bundles up the samples needed for a ray. A series of Get1D() and Get2D() calls initializes its member variables. We omit the fragment that initializes the indirect as it is more of the same. We have carefully ordered the sample generation method calls below to match their use in the VolPathIntegrator so that the two integrators use the same sample values for their sampling tasks at each pixel sample.
The three sample dimensions for sampling the light source are available in the direct substructure.
Similarly, the dimensions for BSDF sampling and Russian roulette are available in indirect.
haveMedia indicates whether medium samples have been stored, which makes it possible to save bandwidth when they are unset.
Sample values are squirreled away in PixelSampleState rather than being passed along via work queues. This does mean that both storing sample values here and reading them in subsequent kernels is not done with coherent memory accesses, since a thread group’s pixelIndex values will not necessarily be contiguous after the initial camera rays. However, because they are not used in many kernels, passing them along through work queues would entail multiple instances of reading them from one queue just to write them to another, which would be a waste of bandwidth. We have found that the current approach gives marginally better performance in practice.
One shortcoming of the approach implemented in this section is that samples are still generated for rays that do not intersect anything. A more optimized implementation might try to defer sample generation until the specific samples required were known, though the benefits are likely to be marginal: on the GPU, if the samples needed vary over the rays in a thread group, then—given the GPU’s thread group execution model—there may be no savings from skipping the work for some threads if it is still needed by others. Further, sample generation is normally just a few percent of overall rendering time, and so anything more sophisticated is not worth bothering with, at least for pbrt’s requirements.
15.3.6 Intersection Testing
Ray intersections are handled differently depending on whether the wavefront integrator is running on the CPU or the GPU. On the CPU, the ray queues are consumed using ParallelFor() calls and pbrt’s regular acceleration structures from Chapter 7 are used to find intersections. On the GPU, platform-specific functionality is used to do so. In order to abstract the differences between these approaches (and to make it easier to add support for additional GPU architectures), ray intersection work done by the WavefrontPathIntegrator is handled by an implementation of the WavefrontAggregate class, which defines an interface that reflects the integrator’s needs.
The WavefrontPathIntegrator stores a WavefrontAggregate in its aggregate member variable. The CPU implementation, CPUAggregate, is found in the source files [EntityList: wavefront/cmd: -: aggregate.h] and wavefront/aggregate.cpp. The implementation for NVIDIA GPUs, OptiXAggregate, is found in gpu/aggregate.h and gpu/aggregate.cpp.
All WavefrontAggregates must provide a method that returns the bounds of the entire scene.
IntersectClosest() traces a set of rays and finds their closest surface intersections. Beyond the queue that provides the rays to be traced, a number of additional queues must be provided to it. Further work for a ray may be added to multiple queues depending on the specifics of its surface intersection.
The WavefrontPathIntegrator’s call to IntersectClosest() mostly passes the corresponding queues from its member variables, though calls to GetCurrentQueue() and NextRayQueue() are necessary to get the appropriate instances of those queues.
In order to discuss the responsibilities of IntersectClosest() implementations, we will focus on its implementation in CPUAggregate. Rather than once again repeating the unwieldy list of arguments here, we will proceed directly to the method implementation, which starts with a parallel for loop over the items in the queue.
A regular aggregate stored in CPUAggregate handles the ray intersection test, with different cases afterward for rays that have an intersection with a surface and rays that do not.
The details of enqueuing further work for rays that have no intersections are handled by the EnqueueWorkAfterMiss() function, which is defined in the file wavefront/intersect.h. That header file provides a number of functions that are used by both the CPU- and GPU-based ray intersection code. Gathering them there allows a single implementation to be used by both, which in turn reduces the complexity of those WavefrontAggregate implementations.
Unlike many of the other functions in wavefront/intersect.h, EnqueueWorkAfterMiss() is simple enough that it barely merits its own function—if the ray is passing through participating media, it is enqueued for medium sampling, and otherwise it is enqueued for evaluating infinite light sources’ contribution to its radiance, if there are any in the scene.
For rays that do intersect a surface, there is more to be done. The EnqueueWorkAfterIntersection() function, which is not included in the text, handles all the following details.
If the ray is passing through participating media, it is enqueued for medium sampling. Only if the ray is not absorbed or scattered in the medium sampling kernel is work then queued for the surface intersection to be processed.
If a ray with an associated intersection is not scattered or absorbed by participating media and hits an emissive surface, it is added to a queue so that the surface’s scattered radiance will be added to the ray’s radiance estimate. Rays hitting surfaces are also sorted by the surfaces’ materials and the complexity of their textures into basicEvalMaterialQueue or universalEvalMaterialQueue, both of which are MultiWorkQueues. (Section 15.3.9 describes how the material queues are organized.) Finally, rays that hit surfaces with no materials that represent medium transitions are pushed on to the nextRayQueue to be continued in the next iteration, on the other side of the surface intersection with their medium member variable updated accordingly.
GPU Ray Intersections
Following our custom of not including platform-specific code in the book text, we will not discuss the details of pbrt’s use of CUDA and the OptiX API for its GPU ray-tracing implementation, but we will summarize the abstractions currently used for GPU ray tracing. See the files gpu/optix.h and gpu/optix.cu for details, however.
Current GPU ray intersection APIs follow a different model than the CPU-focused accelerators in Chapter 7. Those accelerators provide a fully procedural model, where the user calls functions that take a single ray and execute synchronously before returning their results. GPU ray tracing is based on a programmable pipeline, where some functionality is provided by the GPU vendor (either in hardware or in software), and some is provided by the user in the form of code that is executed at particular points in the pipeline.
The user-supplied code is in the form of a series of shaders, each of which is a function that is invoked in specific cases. In practice, many instances of these shaders run concurrently, following the GPU’s thread group execution model. Ray tracing starts with a ray generation shader that is responsible for generating a ray and submitting it to the GPU ray-tracing function. In pbrt’s implementation, the ray generation shader retrieves the ray from the RayQueue.
GPUs currently only have native support for intersecting rays with triangles. If a scene has no other types of shapes, then the intersection tests are handled entirely by the GPU’s ray-tracing implementation. For other types of shapes, custom intersection shaders can be provided by the user. The user specifies a shape’s axis-aligned bounding box and associates an intersection shader with it. When a ray intersects that box, the intersection shader is invoked to determine if there is an intersection. If there is, both the parametric value along the ray and a handful of additional user-defined values can be returned to associate with the intersection.
pbrt uses custom intersection shaders for all the quadric shapes as well as for the BilinearPatch. All follow the same pattern. For example, for bilinear patches, the intersection shader calls the previously defined IntersectBilinearPatch() function. Recall that in the event of an intersection, it returns a BilinearIntersection, not a full SurfaceInteraction. This design is intentional, as a BilinearIntersection is much smaller—just 3 Floats. Returning a small representation of the intersection is beneficial for performance, as it reduces how much information must be written to memory.
Alpha testing adds an additional complication to intersection testing (recall the discussion of alpha textures in Section 7.1.1). An any hit shader is applied to any primitives with alpha textures, such as leaves with alpha masks. The any hit shader executes for all intersections with such primitives, before it is known if an intersection is the closest. Our implementation uses the any hit shader to evaluate the alpha texture and apply a stochastic test, just like the <<Possibly ignore intersection based on stochastic alpha test>> fragment in the GeometricPrimitive class used by the CPU. If the test fails, then the GPU is instructed to ignore the intersection completely.
Once ray intersection testing has been completed, one of two shaders is invoked. For rays that have no intersections, a miss shader is called. Otherwise, a closest hit shader is invoked for processing at the intersection point. Now a full SurfaceInteraction is needed. All the shapes that are handled with custom intersection shaders have a method that converts their compact intersection representation to a SurfaceInteraction; for example, it is BilinearPatch::InteractionFromIntersection() for bilinear patches. The GPU reports the barycentric coordinates of triangle intersections, which are sufficient for the Triangle::InteractionFromIntersection() method to do its work.
Given final intersections (or the determination that a ray does not intersect anything), additional work is enqueued using functions from wavefront/intersection.h, as is the case for the wavefront CPU ray-tracing aggregate.
15.3.7 Participating Media
In the interest of space, we will not walk through the code for the kernels launched by the medium sampling method, SampleMediumInteraction(). Algorithmically, it matches the corresponding code in VolPathIntegrator, so we will just summarize the queues and types of work involved.
For any ray with a non-nullptr Medium, the ray intersection code enqueues a MediumSampleWorkItem in the mediumSampleQueue. A first medium-related kernel processes all the entries on this queue. Its task is to call the ray medium’s SampleT_maj() method, adding emission at each sampled point before sampling one of absorption, real scattering, or null scattering. Absorption causes path termination; real scattering causes work to be added to another queue, mediumScatterQueue, that holds MediumScatterWorkItems; and null scattering causes medium sampling to continue. The path throughput and path sampling PDFs are updated along the way. In the end, if the ray is neither scattered nor absorbed, then work is added to queues in the same manner as for rays that are not passing through media in the closest hit shader.
A second kernel is then launched to process all the medium scattering events in the mediumScatterQueue. The usual sampling process ensues: a light and then a point on it are sampled, path sampling PDFs are computed, and a shadow ray is enqueued. Next, the phase function is sampled to generate an indirect ray direction. Work for that ray is then added to the next wavefront depth’s RayQueue via a call to PushIndirectRay().
15.3.8 Ray-Found Emission
Two kernels handle rays that may add emission to their radiance estimates due to their interaction with emissive entities. The first processes rays that have left the scene and the second handles rays that intersect emissive surfaces.
escapedRayQueue is only allocated if the scene has one or more infinite area lights; there is otherwise no work to be done for such rays.
Note that HandleEscapedRays() returns immediately if there is no queue and thus no infinite area lights, saving the cost of an unnecessary kernel launch in that case.
The work item for escaped rays stores the ray origin, direction, and depth as well as its wavelengths. The ray’s associated pixel index makes it possible to add any found emission to its radiance estimate.
The kernel’s implementation parallels the <<Accumulate contributions from infinite light sources>> fragment in the VolPathIntegrator, just using the information about the ray from the EscapedRayWorkItem.
The final result is then added to the ray’s associated PixelSampleState::L value, so long as L is nonzero. If it is zero, skipping the unnecessary update may not lead to fewer instructions being executed, given the GPU’s execution model, but it will save memory bandwidth, which can be just as important to performance.
For infinite area lights that are directly visible or are encountered through specular reflection, MIS is performed using only the unidirectional path PDF, since that is the only way the path could have been sampled.
The path throughput beta is needed to weight the light’s contribution. Further, whether or not the previous bounce was due to specular reflection must be tracked. Note that this value only requires a single bit; using a full 32-bit int is wasteful. A more optimized implementation might save some bandwidth by stealing one of the bits from pixelIndex, which does not need all 32 of them.
If other types of scattering preceded the ray’s escape, MIS weights are computed using both the light and unidirectional sampling PDFs, following the same approach as is implemented in the <<Add infinite light contribution using both PDFs with MIS>> fragment in the VolPathIntegrator.
In order to compute MIS weights, the EscapedRayWorkItem must provide not only rescaled path probabilities but also geometric information about the previous path scattering vertex.
The second kernel handles rays that intersect emissive surfaces.
The HitAreaLightQueue stores HitAreaLightWorkItems.
The first step in the kernel is to compute the emitted radiance at the ray’s intersection point. If there is none, the kernel can return immediately. Note that it thus could be beneficial if a Light method was added that did a quick conservative test for this case. Such a method could make it possible not to pay the cost of enqueuing work for cases such as a one-sided light source that was intersected on its non-emissive side. With the current implementation, that case is detected only here, costing both bandwidth for the queue work and execution divergence from the threads that return. (Such a method should be simple, deferring more complex tasks like performing texture lookups for surfaces that use image maps to modulate their emission, in order not to harm performance.)
We also note that the call to Light::L() would be a potential source of execution divergence if pbrt had more than one Light implementation that could be used for emissive surfaces. Currently, there is only DiffuseAreaLight, so this is not a concern, but if there were more, it might be worthwhile to have a separate queue for each type of area light in order to avoid this divergence.
The following HitAreaLightWorkItem member variables provide the information necessary to compute the emitted radiance from the intersection point back along the ray.
The final path contribution is found similarly to how it is for escaped rays and infinite area lights.
Once again, the specularBounce member could be packed in elsewhere in order to save storage and reduce bandwidth requirements.
We will not include the <<Compute MIS-weighted radiance contribution from area light>> fragment here, which closely parallels the corresponding case in the VolPathIntegrator as well as the earlier fragment <<Compute MIS-weighted radiance contribution from infinite light>>.
The final weighted radiance value is then accumulated in the ray’s PixelSampleState.
The queues for both of these kernels need to be reset at the start of each ray depth iteration, so we will add the corresponding Reset() calls to the queue-resetting fragment defined earlier.
Having seen these two kernels, it is fair to ask: why not handle these cases immediately in the ray intersection and medium sampling kernels, rather than incurring the bandwidth costs of queuing up work there and then consuming it here? This is yet another trade-off of bandwidth versus execution convergence. Doing this work in separate kernels for only the cases where it is required means that the kernels both start execution fully converged, with all threads doing useful work. In the intersection and medium scattering kernels, we would generally expect that only a subset of the rays would leave the scene or intersect emissive surfaces. In that case, we would have control divergence and all rays in the thread group that did not intersect an emissive surface would incur a performance cost if even one of the others did. The optimal trade-off depends on both the complexity of the computation to be done in those cases and the amount of bandwidth offered by the GPU.
15.3.9 Surface Scattering
With the WavefrontPathIntegrator, the majority of rendering time is usually spent in the kernels responsible for surface scattering. These kernels are specialized by the surfaces’ materials and, in turn, the type of BxDF that each material uses for the BSDF it returns. Starting from a ray–shape intersection, a surface-scattering kernel handles everything from normal and bump mapping to material evaluation, light and BSDF sampling, and queuing up shadow and indirect rays for later processing.
All of this starts with the Render() method calling EvaluateMaterialsAndBSDFs(), which is implemented in wavefront/surfscatter.cpp. With this method’s implementation, we encounter a new idiom that orchestrates the kernel launches: a call to ForEachType(). In its use here, that function iterates over all of the types that a Material could be and calls a callback function for each one of them. Thus, if pbrt is extended with an additional material, adding that one to the list of materials that Material passes to the TaggedPointer that it inherits from in its declaration in base/material.h automatically leads to a specialized kernel for that material being generated here.
A lambda function is not sufficient to pass to ForEachType(), as it does not pass a value of the given type to the callback but instead invokes its function call operator, passing the type for use in a template specialization. Therefore, we wrap the values needed for the material evaluation method call in a small structure.
ForEachType() invokes the following method for each type of material. We skip over the MixMaterial here, since all instances of it are resolved to one of the other material types before being enqueued for the surface-shading kernels. (See Section 10.5.1 for a discussion of this detail of MixMaterial’s usage.)
The EvaluateMaterialAndBSDF() method does not yet bring us to the point of launching kernels—two considerations are handled beforehand. First, the implementation skips launching the specialized EvaluateMaterialAndBSDF() method for any material types that are not present in the scene. Such work queues will have no entries, so there is no reason to bother with them. This is handled using two arrays, haveBasicEvalMaterial and haveUniversalEvalMaterial, that are initialized in the WavefrontPathIntegrator constructor. (The two further distinguish the materials based on the complexity of their textures, which is a topic that will be discussed immediately after the following fragment.) Both arrays are indexed using the TaggedPointer TypeIndex() method, which returns an integer index for each representable type.
These give two more queues to add to the ones that are reset at the start of tracing rays at each wavefront depth.
Before continuing into the EvaluateMaterialAndBSDF() template specialization, we will detour to discuss texture evaluation in the wavefront integrator in more detail. Recall that when the Material interface was introduced in Section 10.5, it included the notion of a TextureEvaluator. Methods like GetBxDF() and GetBSSRDF() were templated on this type, took an instance of it as a parameter, and used it to evaluate textures rather than calling their Evaluate() methods directly.
There was no point in doing that for CPU rendering: there, a UniversalTextureEvaluator is always used. It immediately forwards texture evaluation requests on to the textures. pbrt’s full set of textures spans a wide range of complexity, however, ranging from trivial constant textures that return a fixed value to complex textures that evaluate noise functions to ones like SpectrumMixTexture that themselves recursively evaluate other textures.
Not only do more complex textures require more registers on the GPU, but the potential for unbounded recursion from the mixture textures requires that the compiler allocate resources to be prepared for that case. In turn, the performance of evaluating the simpler textures can be harmed due to choices the compiler has made for the more complex ones. Because the simpler textures are common, it is thus worthwhile to separate materials according to the complexity of their textures and to have separate kernels for the materials that only use the simpler ones. Doing so can give further benefits from reducing control flow divergence. The following BasicTextureEvaluator class helps with this task.
We will categorize constant textures and plain image map textures as “basic.” Thus, the BasicTextureEvaluator’s CanEvaluate() method iterates over all provided textures and checks that each is one of those types. When material evaluation work is to be enqueued in the EnqueueWorkAfterIntersection() function, the material’s Material::CanEvaluateTextures() method is called with a BasicTextureEvaluator to determine whether the work is valid for the basic material evaluation queues. If not, it goes on the appropriate universalEvalMaterialQueue.
The texture types are easily checked with the TaggedPointer::Is() method. (The scene initialization code creates instances of GPUFloatImageTexture in place of FloatImageTextures when GPU rendering is being used. That class uses platform-specific functionality to perform filtered texture look-ups more efficiently than executing Image methods would. GPUSpectrumImageTexture follows equivalently.)
The corresponding fragment for SpectrumTextures is equivalent and is therefore not included here.
The implementation of the TextureEvaluator evaluation method is key to the efficiency benefits provided by this approach. It would do no good to sort materials based on their textures but to then use pbrt’s regular dynamic dispatch mechanism for texture evaluation: in that case, the compiler would have no insight into the fact that the texture being passed to it must be one of the simple types accepted by CanEvaluate().
Therefore, the evaluation method instead tries casting the texture to each of the supported types until it finds the correct one. It then calls the corresponding evaluation method directly. In this way, the compiler can easily tell that the only texture evaluation methods that might be called are those for FloatConstantTexture, FloatImageTexture, and GPUFloatImageTexture, and it does not need to worry about resource allocation for evaluating the more complex texture types.
The evaluation method for spectrum textures is similar and therefore not included here.
With the motivation for the BasicTextureEvaluator explained, it is possible to better understand why ImageTextureBase in Section 10.4.1 offers scale and invert parameters even though scale and mix textures could be used to achieve the same results. With that capability directly available in textures that can be evaluated by the BasicTextureEvaluator, the UniversalTextureEvaluator can be invoked less often. pbrt actually has a texture rewriting pass that, for example, converts a scale texture with a constant scale applied to an image texture to just an image texture, configured to apply that scale itself. (See, for example, the SpectrumScaledTexture::Create() method implementation for details.)
The following definition of the MaterialEvalQueue type is admittedly complex. However, it is another key to pbrt’s extensibility. For material and BSDF evaluation, we would like to have a MultiWorkQueue where there is a separate queue for each type of Material that pbrt supports, where the type of the work items is the template class MaterialEvalWorkItem, specialized with each material type. While we could enumerate all the currently supported materials in template parameters, to do so would mean that adding a new material to the system would require editing an obscure part of the wavefront integrator implementation for it to be available there as well.
Therefore, we go through some gymnastics in order to define the type of the MultiWorkQueue automatically. First, the TaggedPointer type that Material inherits from includes a type declaration, Types, that holds a TypePack of all types that the pointer can represent. We then use the MapType functionality from Section B.4.3 to wrap each material type inside the forthcoming MaterialEvalWorkItem template class. This gives us the final TypePack of types to provide to MultiWorkQueue.
It is crucial to have a pointer to the material in MaterialEvalWorkItem. Note that this can be a pointer to a concrete material type such as DiffuseMaterial due to MaterialEvalWorkItem being a template class on the ConcreteMaterial type. (We will introduce the rest of the member variables of the MaterialEvalWorkItem as they are used in code in the remainder of the section.)
With this context in hand, we can proceed to the implementation of the EvaluateMaterialAndBSDF() method. It is parameterized both by the concrete type of material that is being evaluated and by a TextureEvaluator, which allows us to generate specializations based not only on material but also on the two types of TextureEvaluator.
After some initialization work that includes getting a pointer to the WorkQueue for material evaluation of ConcreteMaterial, the work items in the queue are processed in parallel.
The structure of the material evaluation and sampling kernel parallels that of the surface-scattering–focused part of VolPathIntegrator::Li().
One difference from the VolPathIntegrator is that the wavefront integrator does not carry ray differentials with the rays it traces. Therefore, in order to compute filter widths for texture filtering, the Camera’s Approximate_dp_dxy() method is used to compute approximate differentials at the intersection point. While the resulting filter width estimates are less accurate if there has been specular reflection or transmission from curved surfaces, they usually work well in practice. Note that by declaring local variables dpdu and dpdv at the end of this fragment that store the surface’s partial derivatives, we are able to reuse the earlier fragment <<Estimate screen-space change in >> to compute the texture derivatives.
Before any further work is done, the effect of normal or bump mapping on the local shading geometry is found, if appropriate.
MaterialEvalWorkItem carries along information about the initial shading geometry from the intersected shape for a starting point for calls to NormalMap() or BumpMap() and for direct use if there is no normal or bump mapping at the intersection point.
As before, all work related to normal mapping is performed in the NormalMap() function and bump mapping is handled by BumpMap(). Note that here we are able to avoid dynamic dispatch in the calling of the material’s GetDisplacement() and GetNormalMap() methods thanks to ConcreteMaterial being a known specific material type. If a displacement texture or normal map is present, the NormalBumpEvalContext for the call to BumpMap() comes from a MaterialEvalWorkItem utility function that we do not include here; it just initializes a NormalBumpEvalContext using appropriate values from its member variables.
On the GPU, there are a few ways that there may be execution divergence here. First, if some of the threads in the thread group have normal maps or displacement textures and some do not, then all of them will pay the cost of executing the NormalMap() and BumpMap() functions. Second, some threads may have displacement textures and others may have normal maps, which will lead to execution divergence. Finally, the types of the FloatTextures used for defining displacements may vary across the threads. Each of these factors could be used to sort work more finely, though we have not found this divergence to be too much of a problem in practice.
The fragment that calls BumpMap() for bump mapping parallels the one for normal mapping, so it is not included here.
The BSDF is initialized here in a different way than in the VolPathIntegrator. Because the EvaluateMaterialAndBSDF() method is specialized on the material type, and because Materials in pbrt must provide a type definition for their associated BxDF, it is possible to stack-allocate the BxDF here rather than use an instance of the ScratchBuffer class to allocate space for it. It is then possible to call the material’s GetBxDF() method directly rather than using dynamic dispatch through a call to Material::GetBSDF() to do so.
The benefit from avoiding use of the ScratchBuffer for BxDFs is significant: just as stack-allocating the concrete Sampler type in the camera ray generation and sample generation kernels made it possible for the Sampler’s member variables to be stored in GPU registers, this approach does the same for the BxDF, giving a substantial performance benefit compared to storing them in device memory.
The MaterialEvalWorkItem GetMaterialEvalContext() method, which is also not included here, initializes a MaterialEvalContext from its corresponding member variables.
The call to GetBxDF() above may cause secondary wavelengths to be terminated—for example, in case of dispersion. It is therefore important that the stack-allocated lambda value both be used here in this kernel and also be passed along to subsequent rendering kernels, rather than the initial SampledWavelengths value from the MaterialEvalWorkItem. If the secondary wavelengths in lambda have been terminated, the copy of the SampledWavelengths for the path in PixelSampleState must be updated. Note that the implementation here may end up writing lambda multiple times redundantly at subsequent intersections in that case.
BSDF regularization proceeds as before, only happening if the option has been enabled and a ray has undergone a non-specular scattering event.
The anyNonSpecularBounces member variable is yet another single-bit quantity taking up 32 bits. A more bandwidth-efficient implementation would pack this value into a free bit elsewhere in the MaterialEvalWorkItem.
We will omit the fragment that initializes the VisibleSurface at the first intersection, as it is not especially interesting or different than the corresponding code in the CPU-based rendering path.
There are two new things to see in the fragment that samples the BSDF. First, the sample values used come from the RaySamples object rather than from Sampler method calls. Second, because the ConcreteBxDF type is known here at compile time, it is possible to call the templated BSDF::Sample_f() variant that takes the BxDF type and thus avoids dynamic dispatch, calling the appropriate BxDF method implementation directly.
The path throughput and rescaled path probabilities are updated in the same manner as in the VolPathIntegrator <<Update volumetric integrator path state after surface scattering>> fragment.
Only the unidirectional rescaled path probability needs to be passed in to this kernel, since the light path probability is initialized during its execution.
The logic for updating the unidirectional rescaled path probability also follows that of the VolPathIntegrator, including how proportional BSDF samples such as those from the LayeredBxDF are handled. The call to the BSDF::PDF() method presents another opportunity to avoid dynamic dispatch given a compile-time known BxDF, however.
The etaScale factor used in Russian roulette is also updated in the same manner as before.
We will omit the <<Apply Russian roulette to indirect ray based on weighted path throughput>> fragment, which is almost exactly the same as the VolPathIntegrator’s <<Possibly terminate volumetric path with Russian roulette>>, other than the fact that in this case the sample value for the Russian roulette test comes from RaySamples.indirect.rr.
For indirect rays, a ray is initialized and pushed on to the RayQueue for the next level of the ray tree. (We omit the implementation of the RayQueue::PushIndirectRay() method, which stores the provided values in the corresponding member variables.)
The medium member of the Ray must be set manually based on which side of the surface the ray starts in.
The MediumInterface at the intersection point is thus needed in MaterialEvalWorkItem.
Indirect rays require a few additions to the RayWorkItem class including the current path throughput , rescaled path probabilities, and the information necessary to compute MIS weights if the ray encounters emission.
The direct lighting calculation follows a similar path as in the VolPathIntegrator.
We will omit the <<Choose a light source using the LightSampler>> and <<Sample light source and evaluate BSDF for direct lighting>> fragments, as they are essentially the same as corresponding fragments that we have seen in earlier integrators. The first fragment gives us the sampledLight variable that stores a SampledLight and the second calls Light::SampleLi() with that light, giving an ls variable that stores the LightLiSample.
The path throughput and path sampling weights are computed in the same way as they are in the VolPathIntegrator::SampleLd() method, though here it is again possible to use the templated BSDF::PDF() method that avoids the dynamic dispatch overhead to compute the BSDF’s PDF.
We go ahead and compute Ld, which is the final weighted contribution that the shadow ray would give to the pixel sample’s radiance estimate if it is unoccluded. If the ray is unoccluded and there are participating media in the scene, Ld may still be reduced due to the effect of extinction along the shadow ray; that factor is handled when the shadow ray is traced.
15.3.10 Shadow Rays
The set of shadow rays to be traced after material evaluation and the direct lighting calculation in the EvaluateMaterialAndBSDF() method are stored in a ShadowRayQueue.
The work items in the queue store values that include the ray to be traced, its wavelengths, and tentative contribution Ld, as well as the rescaled path probabilities up to the ray’s origin.
The WavefrontAggregate interface includes two methods for tracing shadow rays: IntersectShadow(), which is called for scenes that have no participating media where only a binary visibility test is needed, and IntersectShadowTr() for scenes that do have media and where transmittance must be computed. A call to the WavefrontPathIntegrator::TraceShadowRays() method leads to a call of the appropriate method; the shadow ray queue is reset immediately afterward.
The CPU implementation of IntersectShadow() processes items from the queue in parallel and calls its aggregate’s IntersectP() method to determine if each ray is occluded. A call to RecordShadowRayResult(), another utility function shared by both CPU and GPU aggregates that is defined in wavefront/intersect.h, takes care of the additional work to be done after the intersection test has been resolved.
If the ray was occluded, then no further work is necessary. Otherwise, the final MIS-weighted radiance is found by dividing by the two rescaled path probabilities and is then added to the running sum of the radiance estimate in the PixelSampleState for the ray.
In the presence of participating media, there is more work to do for shadow rays. The computation proceeds just as in the latter two thirds of the VolPathIntegrator::SampleLd() method: the ray is successively intersected against the scene geometry, terminating if it hits an opaque surface. Otherwise a call to SampleT_maj() samples the medium, so that transmittance can be estimated using ratio tracking just as in <<Update transmittance for current ray segment>> in that method. Given the close similarities, we will not include that code for the wavefront integrator here.
15.3.11 Updating the Film
Each pixel sample’s value is provided to the Film only after all of them are finished. One might wonder: why not push samples on to a queue whenever their ray paths terminate and add film samples sooner by periodically running a kernel that processes the items on the queue? There are two costs to such an approach: the first is that samples would be added to the film in an arbitrary order, so the accesses to values like PixelSampleState::L would be irregular across threads in thread groups, which could harm performance.
The second cost is the unnecessary bandwidth that would be used to push work onto the queue and to process it. We know that every pixel sample taken at the start of tracing each path will eventually be added to the film, so there is no need to separately enumerate them. For both of these reasons, it is more efficient to use a plain GPUParallelFor() loop and process the PixelSampleState structures in order.
Recall that the GenerateCameraRays() method sets the PixelSampleState::pPixel coordinates for all the threads, but that it then exits immediately for threads corresponding to extra scanlines beyond the end of the image. Therefore, we must perform the same check of the pixel coordinates against the film bounds here, and not do any further processing for such pixels.
A few more values read from pixelSampleState and we have everything that we need to call the Film’s AddSample() method. The implementation uses the initializeVisibleSurface member variable, set in the constructor, to distinguish between Film implementations that make use of the VisibleSurface and those that do not in order to save the memory bandwidth of reading it if it will not be used.
One interesting detail is that the VisibleSurface must be loaded from SOA format into the regular VisibleSurface structure layout so that a pointer to it can be passed to the AddSample() method; a pointer to its current in-memory representation does not correspond to the layout that this method expects.