15 Wavefront Rendering on GPUs

One of the major changes in pbrt for this edition of the book is the addition of support for rendering on GPUs as well as on CPUs. Between the substantial computational capabilities that GPUs offer and the recent availability of custom hardware units for efficient ray intersection calculations, the GPU is a compelling target for ray tracing. For example, the image in Figure 15.1 takes 318.6 seconds to render with pbrt on a 2020-era high-end GPU at resolution with 2048 samples per pixel. On an 8-core CPU, it takes 11,983 seconds to render with the same settings—over 37 times longer. Even on a high-end 32-core CPU, it takes 2,669 seconds to render (still over 8 times longer).

Figure 15.1: Scene Used for CPU versus GPU Ray Tracing Performance Comparison. (Scene courtesy of Angelo Ferretti.)

pbrt’s GPU rendering path offers only a single integration algorithm: volumetric path tracing, following the algorithms used in the CPU-based VolPathIntegrator described in Section 14.2.3. It otherwise supports all of pbrt’s functionality, using the same classes and functions that have been presented in the preceding 14 chapters. This chapter will therefore not introduce any new rendering algorithms but instead will focus on topics like parallelism and data layout in memory that are necessary to achieve good performance on GPUs.

The integrator described in this chapter, WavefrontPathIntegrator, is structured using a wavefront architecture—effectively, many rays are processed simultaneously, with rendering work organized in queues that collect related tasks to be processed together. (“Wavefront” in this context will be defined more precisely in Section 15.1.2.)

Some of the code discussed in this chapter makes more extensive use of advanced C++ features than we have generally used in previous chapters. While we have tried not to use such features unnecessarily, we will see that in some cases they make it possible to generate highly specialized code that runs much more efficiently than if their capabilities are not used. We had previously sidestepped many low-level optimizations due to their comparatively small impact on CPUs. Such implementation-level decisions can, however, change rendering performance by orders of magnitude when targeting GPUs.

The WavefrontPathIntegrator imposes three requirements on a GPU platform:

It must support a unified address space, where the CPU and GPU can both access the GPU’s memory, using pointers that are consistent on both types of processor. This capability is integral to being able to parse the scene description and initialize the scene representation on the CPU, including initializing pointer-based data structures there, before the same data structures are then used in code that runs on the GPU.
The GPU compilation infrastructure must be compatible with C++17, the language that the rest of pbrt is implemented in. This makes it possible to use the same class and function implementations on both types of processors.
The GPU must have support for ray tracing, either in hardware or in vendor-supplied software. (pbrt’s existing acceleration structures would not be efficient on the GPU in their present form.)

The attentive reader will note that CPUs themselves fulfill all of those requirements, the third potentially via pbrt’s acceleration structures from Chapter 7. Therefore, pbrt makes it possible to execute the WavefrontPathIntegrator on CPUs as well; it is used if the –wavefront command-line option is provided. However, the wavefront organization is usually not a good fit for CPUs and performance is almost always worse than if the VolPathIntegrator is used instead. Nonetheless, the CPU wavefront path is useful for debugging and testing the WavefrontPathIntegrator implementation on systems that do not have suitable GPUs.

At this writing, the only GPUs that provide all three of these capabilities are based on NVIDIA’s CUDA platform, so NVIDIA’s GPUs are the only ones that pbrt currently supports. We hope that it will be possible to support others in the future. Around two thousand lines of platform-specific code are required to handle low-level details like allocating unified memory, launching work on the GPU, and performing ray intersections on the GPU. As usual, we will not include platform-specific code in the book, but see the gpu/ directory in the pbrt source code distribution for its implementation.