A.6 Parallelism
Section 1.4 introduced some basic principles of parallel programming and described their application to pbrt. Here, we’ll go into more detail about performance issues related to multi-threading as well as describe the implementation of pbrt’s ParallelFor() function, which is used throughout the system for parallel for loops, where different iterations of the loop can execute concurrently in different threads.
A.6.1 Memory Coherence Models and Performance
Cache coherence is a feature of all modern multicore CPUs; with it, memory writes by one processor are automatically visible to other processors. This is an incredibly useful feature; being able to assume it in the implementation of a system like pbrt is extremely helpful to the programmer. Understanding the subtleties and performance characteristics of this feature is important, however.
One potential issue is that other processors may not see writes to memory in the same order that the processor that performed the writes issued them. This can happen for two main reasons: the compiler’s optimizer may have reordered write operations to improve performance, and the CPU hardware may write values to memory in a different order than the stream of executed machine instructions. In the single-threaded case, both of these are innocuous; by design, the compiler and hardware, respectively, ensure that it’s impossible for a single thread of execution running the program to detect when these cases happen. This guarantee is not provided for multi-threaded code, however; doing so would impose a significant performance penalty, so hardware architectures leave handling this problem, when it matters, to software.
Memory barrier instructions can be used to ensure that all write instructions before the barrier are visible in memory before any subsequent instructions execute. In practice, we generally don’t need to issue memory barrier instructions explicitly, since the thread synchronization calls used to build multi-threaded algorithms take care of this; they are defined to make sure that writes are visible so that if we are coordinating execution between multiple threads using these calls, then they have a consistent view of memory after synchronization points.
Although cache coherence is helpful to the programmer, it can sometimes impose a substantial performance penalty for data that is frequently modified and accessed by multiple processors. Read-only data has little penalty; copies of it can be stored in the local caches of all of the processors that are accessing it, allowing all of them the same performance benefits from the caches as in the single-threaded case. To understand the downside of taking too much advantage of cache coherence for read–write data, it’s useful to understand how cache coherence is typically implemented on processors.
CPUs implement a cache coherence protocol, which is responsible for tracking the memory transactions issued by all of the processors in order to provide cache coherence. A classic such protocol is MESI, where the acronym represents the four states that each cache line can be in. Each processor stores the current state for each cache line in its local caches:
- Modified—The current processor has written to the memory location, but the result is only stored in the cache—it’s dirty and hasn’t been written to main memory. No other processor has the location in its cache.
- Exclusive—The current processor is the only one with the data from the corresponding memory location in its cache. The value in the cache matches the value in memory.
- Shared—Multiple processors have the corresponding memory location in their caches, but they have only performed read operations.
- Invalid—The cache line doesn’t hold valid data.
At system startup time, the caches are empty and all cache lines are in the invalid state. The first time a processor reads a memory location, the data for that location is loaded into cache and its cache line marked as being in the “exclusive” state. If another processor performs a memory read of a location that is in the “exclusive” state in another cache, then both caches record the state for the corresponding memory location to instead be “shared.”
When a processor writes to a memory location, the performance of the write depends on the state of the corresponding cache line. If it’s in the “exclusive” state and already in the writing processor’s cache, then the write is cheap; the data is modified in the cache and the cache line’s state is changed to “modified.” (If it was already in the “modified” state, then the write is similarly efficient.) In these cases, the value will eventually be written to main memory, at which point the corresponding cache line returns to the “exclusive” state.
However, if a processor writes to a memory location that’s in the “shared” state in its cache or is in the “modified” or “exclusive” state in another processor’s cache, then expensive communication between the cores is required. All of this is handled transparently by the hardware, though it still has a performance impact. In this case, the writing processor must issue a read for ownership (RFO), which marks the memory location as invalid in the caches of any other processors; RFOs can cause stalls of tens or hundreds of cycles—a substantial penalty for a single memory write.
In general, we’d therefore like to avoid the situation of multiple processors concurrently writing to the same memory location as well as unnecessarily reading memory that another processor is writing to. An important case to be aware of is “false sharing,” where a single cache line holds some read-only data and some data that is frequently modified. In this case, even if only a single processor is writing to the part of the cache line that is modified but many are reading from the read-only part, the overhead of frequent RFO operations will be unnecessarily incurred.
A situation where many processors might be concurrently trying to write to the same or nearby memory locations is when image sample values are accumulated into the final image. To ensure that image updates don’t pay the RFO cost, each rendering thread in the ParallelFor() loop of the SamplerIntegrator creates a private FilmTile to use for accumulating sample values for the part of the image that it’s working on; it is then free to modify the FilmTile pixel values without worrying about contention with other threads for those memory locations. Only when a portion of the image is finished is the tile merged into the main image, thus allowing the overhead of mutual exclusion and RFO operations to be amortized over a smaller number of larger updates.
A.6.2 Atomic Operations
Recall from Section 1.4 that mutexes can be used to ensure that multiple threads don’t simultaneously try to update the same memory locations. However, modern CPUs and GPUs also provide specialized hardware instructions to perform certain operations atomically, generating consistent results when multiple threads use them to modify the same location concurrently. When applicable, atomics are generally more efficient than acquiring a mutex, updating the memory location, and releasing the mutex. Atomic instructions can only operate on a limited amount of memory (up to 8 bytes on current architectures) and support only a few operations (addition, swap, etc.). If atomic updates to more data or other kinds of operations are required, mutexes must generally be used instead.
C++11 provides a variety of atomic operations in the standard library, available via the <atomic> header file. For example, given the declaration of an integer value as std::atomic as follows, incrementing counter is an atomic operation.
Atomic instructions do introduce some overhead, so they should only be used in cases where they are actually necessary.
Another useful atomic operation is “compare and swap,” which is also exposed by the C++ standard library. It takes a memory location and the value that the caller believes the location currently stores. If the memory location still holds that value when the atomic compare and swap executes, then a new value is stored and true is returned; otherwise, memory is left unchanged and false is returned.
Compare and swap is a building block that can be used to build many other atomic operations. For example, the code below could be executed by multiple threads to compute the maximum of values computed by all of the threads. (For this particular case, the specialized atomic maximum function would be a better choice, but this example helps convey the usage.)
If only a single thread is trying to update the memory location and the local value is larger, the loop is successful the first time through; the value loaded into currentMax is still the value stored by maxValue when compare_exchange_weak() executes and so localMax is successfully stored and true is returned. If multiple threads are executing concurrently, then another thread may update the value in maxValue between the thread’s read of maxValue and the execution of compare_exchange_weak(). In that case, the compare and swap fails, memory isn’t updated, and another pass is taken through the loop to try again. In the case of a failure, compare_exchange_weak() updates currentMax with the new value of maxValue.
An important application of atomic compare and swap is for the construction of data structures (as is done in Section 16.2.5 for photon mapping). Consider, for example, a tree data structure where each node has child node pointers initially set to nullptr. If code traversing the tree wants to create a new child at a node, code could be written like:
The idea is that if the child has the value nullptr, the thread speculatively creates and fully initializes the child node into a local variable, not yet visible to the other threads. Atomic compare and swap is then used to try to initialize the child pointer; if it still has the value nullptr, then the new child is stored and made available to all threads. If the child pointer no longer has the value nullptr, then another thread has initialized the child in the time between the current thread first seeing that it was nullptr and later trying to update it. In this case, the work done in the current thread turns out to have been wasted, but it can delete the locally created child node and continue execution, using the node created by the other thread.
This method of tree construction is a simple example of a lock-free algorithm. This approach has a few advantages compared to, for example, using a reader–writer mutex to manage updating the tree. First, there’s no overhead of acquiring the reader mutex for regular tree traversal. Second, multiple threads can naturally concurrently update different parts of the tree. With a single reader–writer mutex, if one thread acquires the mutex to update one node in the tree, other threads won’t be able to update other nodes. The “Further Reading” section at the end of the appendix has pointers to more information about lock-free algorithms.
A.6.3 Atomic Floating-Point Values
The std::atomic template cannot be used with floating-point types. One of the main reasons that atomic operations are not supported with it is that floating-point operations are generally not associative: as discussed in Section 3.9.1, when computed in floating-point, the value of the sum (a+b)+c is not necessarily equal to the sum a+(b+c). In turn, if a multi-threaded computation used atomic floating-point addition operations to compute some value, then the result computed wouldn’t be the same across multiple program executions. (In contrast, with integer types, all of the supported operations are associative, and so atomic operations give consistent results no matter which order threads perform them in.)
For pbrt’s needs, these inconsistencies are generally tolerable, and being able to use atomic operations on Floats is preferable in some cases to using a lock. (One example is splatting pixel contributions in the Film::AddSplat() method.) For these purposes, we provide a small AtomicFloat class.
An AtomicFloat can be initialized from a provided floating-point value. In the implementation here, floating-point values are actually represented as their unsigned integer bitwise values, as returned by the FloatToBits() function.
By using a uint32_t to represent the value, we can use a std::atomic type to store it in memory, which in turn allows the compiler to be aware that the value in memory is being updated atomically. (If pbrt has been compiled to use 64-bit doubles for Float values, a uint64_t is used instead, though this code isn’t included here.)
Assigning the value or returning it as a Float is just a matter of converting to or from the unsigned integer representation.
Atomic floating-point addition is implemented via an atomic compare and exchange operation. In the do loop below, we convert the in-memory bit representation of the value to a Float, add the provided difference in v, and attempt to atomically store the resulting bits. If the in-memory value has been changed by another thread since the value from bits was read from memory, the implementation continues retrying until the value in memory matches the expected value (in oldBits), at which point the atomic update succeeds.
pbrt doesn’t currently need to perform any other operations on AtomicFloats, so we don’t provide any additional methods.
A.6.4 Parallel For Loops
All of the multi-core parallelism in pbrt is expressed through parallel for loops using the ParallelFor() function, which is implemented in the files core/parallel.h and core/parallel.cpp. ParallelFor() takes the loop body in the form of a function that is called for each loop iteration as well as a count of the total number of loop iterations to execute. It generally runs multiple iterations in parallel on different CPU cores and it returns only after all of the loop iterations have finished. In using ParallelFor(), the caller makes the implicit promise that it’s safe to execute multiple loop iterations concurrently. An important implication of this promise is that the order in which the loop iterations are executed must not affect the final results computed.
Here is a simple example of using ParallelFor(). A C++ lambda expression is used to define the loop body; the loop index is passed to it as an argument. The lambda has access to the local array variable and doubles each array element in its body. Note that the value 1024 is passed as the second parameter to ParallelFor() after the lambda, giving the number of times to execute the loop body.
While it’s also possible to pass a function pointer to ParallelFor(), lambdas are generally much more convenient given their ability to capture locally visible variables and make them available in their body.
For loops with relatively large iteration counts where the work done per iteration is small, it can be worthwhile to have the threads running loop iterations do multiple iterations before getting more work. (Doing so helps amortize the overhead of determining which iterations should be assigned to a thread.) Therefore, ParallelFor() also takes an optional chunkSize parameter that controls the granularity of the mapping of loop iterations to processing threads.
ParallelFor() usually distributes loop iterations across multiple threads. However, if the system has only one CPU (or the user specified that only one thread should be used for rendering), or if the number of loop iterations is small, then the loop just runs immediately in the current thread, without any parallelism.
Parallel execution is implemented using a set of worker threads (a thread pool) that is created the first time ParallelFor() is called. The threads don’t terminate after ParallelFor() returns, however; instead they wait on a condition variable that signals more work. This approach means that using the threads for parallel work is a fairly lightweight operation—the overhead of numerous operating system calls to create the threads is only paid once. (This implementation approach is often called persistent threads.) It’s thus possible to use the thread pool for fairly fine-grained tasks, which in turn lets the system load-balance well when tasks have variable amounts of computation and lets the system scale well as more cores are available in the future.
pbrt’s initial execution thread also helps run loop iterations, so the number of worker threads launched is one fewer than the number of available CPU cores. There is thus a one-to-one relationship between cores and worker threads. Notwithstanding other processes running on the system, pbrt’s threads are collectively enough to fully occupy the CPUs without introducing unnecessary thread-switching overhead from having more threads running than there are available cores. (NumSystemCores() returns the number of processing cores in the system.)
The function that worker threads run, workerThreadFunc(), will be introduced after we show how the state of enqueued parallel for loops is represented.
In the following, threads will need to determine which of the running threads they are. The ThreadIndex variable is declared with a qualifier that indicates that thread-local storage should be allocated for it, so that there is a separate instance of it for each thread. This variable is initialized to 0 for the main thread and goes from 1 to the number of threads for the worker threads.
The workList variable holds a pointer to the head of a list of parallel for loops that aren’t yet finished. Usually, there will be no more than one loop in this list, except in the presence of nested parallelism, when the body of one parallel for loop iteration specifies another parallel for loop in its body. The workListMutex must always be held when accessing workList or values stored in the ParallelForLoop objects held in it.
Adding a new loop to the work queue is fairly straightforward. After initializing the ParallelForLoop object that represents the loop’s work, the implementation here locks the mutex and adds the loop to the head of the list. There are two important details here: first, because the call to ParallelFor() here doesn’t return until all work for the loop is done, it’s safe to allocate loop on the stack—no dynamic memory allocation is required.
Second, the loop is added to the front of the work list—doing so means that in the presence of nested parallelism, the inner loops will run before their enclosing loops. This leads to depth-first processing of the nested loops (rather than breadth-first), which in turn can avoid an explosion in the number of loops in the work list.
The ParallelForLoop class encapsulates the relevant information about a parallel for loop body, including the function to run, the number of iterations, and which iterations are already done.
ParallelForLoop can represent loops over both 1D and 2D domains corresponding to the variants of the ParallelFor() function. In the following, we’ll only show the code for the 1D case.
The nextIndex member variable tracks the next loop index to be executed. It is incremented by workers as they claim loop iterations to execute in their threads. The value stored in activeWorkers records how many worker threads are currently running iterations of the loop. next is used to maintain the linked list of nested loops.
A parallel for loop is only finished when the index has been advanced to the end of the loop’s range and there are no threads currently working on it. Note that the first of these conditions will be reached while work is still in progress.
After the loop has been added to the work list, the worker threads are signaled so that they wake up and start taking work from the list.
Finally, the thread that called ParallelFor() (be it the main thread or one of the worker threads) starts work on the loop. In the presence of nested parallelism, this means that the thread that enqueued this loop works on it exclusively before returning. By finishing the loop before allowing the thread that submitted it to do any more work, the implementation keeps the amount of enqueued work limited and allows subsequent code in the caller to proceed, knowing the loop’s work is done after its call to ParallelFor() returns.
A lock to workListMutex is always held going into the while loop here. Note that the lock is necessary even for calling the Finished() method, since loop is stored in workList and thus will be accessed by the other threads.
Each time through the while loop, the thread runs one or more iterations of the parallel loop’s body.
The range of iterations goes from the current index to chunkSize ahead subject to the total number of iterations.
Now that the thread has found the iterations that it will run, loop must be updated. If this thread took the final iterations, the loop is removed from the work list so that other threads can start on the next loop (if any).
Given the range of loop iterations to run, it’s fairly straightforward to call back to the std::function representing the loop body. This is the only time in the enclosing while loop that the lock is relinquished, though the time spent running these loop iterations is generally the majority of the time spent in the while loop, so other worker threads generally don’t need to wait long for the lock. The <<Handle other types of loops>> fragment, not included here, handles the 2D loop supported by ParallelForLoop.
After running the set of loop iterations and re-acquiring the lock, the active worker count is updated to reflect that (for now at least) the current thread is no longer working on loop.
While the thread that called ParallelFor() is working on the loop, the other threads also run iterations. workerThreadFunc() is the function that runs to do this in each task execution thread. Its structure is similar to the fragment <<Help out with parallel loop iterations in the current thread>>, with three main differences. First, it runs loops from whatever ParallelForLoops are in workList, not just from a single parallel for loop. Second, it has the thread sleep whenever there isn’t any work to be done. Finally, it continues waiting for more loops to run until the shutdownThreads variable is set, which only happens at the end of program execution.
As before, a lock to the workListMutex must be held at entry to the while loop here.
If there is no available work, the worker thread waits on the workListCondition condition variable. The semantics of condition variables are such that doing so releases the lock, but when this thread is later woken up by the condition variable being signaled, it will again hold the lock.
Otherwise, a range of loop iterations to run is taken from the head of workList. The code to run iterations is reused from the <<Run a chunk of loop iterations for loop>> fragment defined earlier.
Finally, as will be discussed shortly in Appendix A.7.1, the worker thread must call ReportThreadStats() before exiting so that its per-thread statistics are merged into the aggregate statistics.
A variant of ParallelFor() takes a Point2i to describe a 2D iteration domain from to the given point. This version is used to loop over image buckets in Section 1.3.4, for example.
The ThreadIndex variable allows code in parallel for loops to use preallocated temporary buffers or objects like MemoryArenas, giving a separate instance to each worker thread without needing to worry about data races. (See, for example, the <<Generate SPPM visible points>> fragment.) For this use, it’s also useful for calling code to be able to find out the maximum possible thread index.
TerminateWorkerThreads(), not included here, cleans up the resources allocated for the threads.