Since OpenCL 2.0, the OpenCL C device programming language includes a set of work-group parallel reduction and scan built-in functions. These functions allow developers to execute local reductions and scans for the most common operations (addition, minimum and maximum), and allow vendors to implement them very efficiently using hardware intrinsics that are not normally exposed in OpenCL C.

In this article I aim at challenging the idea that exposing such high-level functions, but not the lower-level intrinsics on which their efficient implementation might rely, results in lower flexibility and less efficient OpenCL programs, and is ultimately detrimental to the quality of the standard itself.

While the arguments I will propose will be focused specifically on the parallel reduction and scans offered by OpenCL C since OpenCL 2.0, the fundamental idea applies in a much more general context: it is more important for a language or library to provide the building blocks on which to build certain high-level features than to expose the high-level features themselves (hiding the underlying building blocks).

For example, the same kind of argument would apply to a language or library that aimed at providing support for Interval Analysis (IA). A fundamental computational aspect which is required for proper IA support is directed rounding: just exposing directed rounding would be enough to allow efficient (custom) implementations of IA, and also allow other numerical feats (as discussed here); conversely, while it's possible to provide support for IA without exposing the underlying required directed rounding features, doing so results in an inefficient, inflexible standard1.

The case against high-level reduction operations

To clarify, I'm not actually against the presence of high-level reduction and scan functions in OpenCL. They are definitely a very practical and useful set of functions, with the potential of very efficient implementations by vendors —in fact, more efficient than any programmer may achieve, not just because they can be tuned (by the vendor) for the specific hardware, but also because they can in fact be implemented making use of hardware capabilities that are not exposed in the standard nor via extensions.

The problem is that the set of available functions is very limited (and must be so), and as soon as a developer needs a reduction or scan function that is even slightly different from the ones offered by the language, it suddenly becomes impossible for such a reduction or scan to be implemented with the same efficiency of the built-in ones, simply because the underlying hardware capabilities (necessary for the optimal implementation) are not available to the developer.

Thrust and Kahan summation

Interesting enough, I've hit a similar issue while working on a different code base, which makes use of CUDA rather than OpenCL, and for which we rely on the thrust library for the most common reduction operations.

The thrust library is a C++ template library that provides efficient CUDA implementations of a variety of common parallel programming paradigms, and is flexible enough to allow such paradigms to make use of user-defined operators, allowing for example reductions and scans with operators other than summation, minimum and maximum. Despite this flexibility, however, even the thrust library cannot move (easily) beyond stateless reduction operators, so that, for example, one cannot trivially implement a parallel reduction with Kahan summation using only the high-level features offered by thrust.

Of course, this is not a problem per se, since ultimately thrust just compiles to plain CUDA code, and it is possible to write such code by hand, thus achieving a Kahan summation parallel reduction, as efficiently as the developer's prowess allows. (And since CUDA exposes most if not all hardware intrinsics, such a hand-made implementation can in fact be as efficient as possible on any given CUDA-capable hardware.)

Local parallel reductions in OpenCL 2.0

The situation in OpenCL is sadly much worse, and not so much due to the lack of a high-level library such as thrust (to which end one may consider the Bolt library instead), but because the language itself is missing the fundamental building blocks to produce the most efficient reductions: and while it does offer built-ins for the most common operations, anything beyond that must be implemented by hand, and cannot be implemented as efficiently as the hardware allows.

One could be led to think that (at least for something like my specific use case) it would be “sufficient” to provide more built-ins for a wider range of reduction operations, but such an approach would be completely missing the point: there will always be variations of reductions that are not provided by the language, and such a variation will always be inefficient.

Implementor laziness

There is also another point to consider, and it has to do with the sad state of the OpenCL ecosystem. Developers that want to use OpenCL for their software, be it in academia, gaming, medicine or any industry, must face the reality of the quality of existing OpenCL implementations. And while for custom solutions one can focus on a specific vendor, and in fact choose the one with the best implementations, software vendors have to deal with the idiosyncrasies of all OpenCL implementations, and the best they can expect is for their customers to be up to date with the latest drivers.

What this implies in this context is that developers cannot, in fact, rely on high-level functions being implemented efficiently, nor can they sit idle waiting for the vendors to provide more efficient implementations: more often than not, developers will find themselves working around the limitations of this and that implementation, rewriting code that should be reduced to one liners in order to provide custom, faster implementations.

This is already the case for some functions such as the asynchronous work-group memory copies (from/to global/local memory), which are dramatically inefficient on some vendor implementations, so that developers are more likely to write their own loading functions instead, which generally end up being just as efficient as the built-ins on the platforms where such built-ins are properly implemented, and much faster on the lazy platforms.

Therefore, can we actually expect vendors to really implement the work-group reduction and scan operations as efficiently as their hardware allows? I doubt it. However, while for the memory copies an efficient workaround was offered by simple loads, such a workaround is impossible in OpenCL 2.0, since the building blocks of the efficient work-group reductions are missing.

Warp shuffles: the work-group reduction building block

Before version 2.0 of the standard, OpenCL offered only one way to allow work-items within a work-group to exchange informations: local memory. The feature reflected the capability of GPUs when the standard was first proposed, and could be trivially emulated on other hardware by making use of global memory (generally resulting in a performance hit).

With version 2.0, OpenCL exposes a new set of functions that allow data exchange between work-items in a work-group, which doesn't (necessarily) depend on local memory: such functions are the work-group vote functions, and the work-group reduction and scan functions. These functions can be implemented via local memory, but most modern hardware can implement them using lower-level intrinsics that do not depend on local memory at all, or only depend on local memory in smaller amounts than would be needed by a hand-coded implementation.

On GPUs, work-groups are executed in what are called warps or wave-fronts, and most modern GPUs can in fact exchange data between work-items in the same warp using specific shuffle intrinsics (which have nothing to do with the OpenCL C shuffle function): these intrinsics allow work-items to access the private registers of other work-items in the same warp. While warps in the same work-group still have to communicate using local memory, a simple reduction algorithm can thus be implemented using warp shuffle instructions and only requiring one word of local memory per warp, rather than one per work-item, which can lead to better hardware utilization (e.g. by allowing more work-groups per compute unit thanks to the reduced use of local memory).

Warp shuffle instructions are available on NVIDIA GPUs with compute capability 3.0 or higher, as well as on AMD GPUs since Graphics Core Next. Additionally, vectorizing CPU platforms such as Intel's can trivially implement them in the form of vector component swizzling. Finally, all other hardware can still emulate them via local memory (which in turn might be inefficiently emulated via global memory, but still): and as inefficient as such an emulation might be, it still would scarcely be worse than hand-coded use of local memory (which would still be a fall-back option to available to developers).

In practice, this means that all OpenCL hardware can implement work-group shuffle instructions (some more efficiently than others), and parallel reductions of any kind could be implemented through work-group shuffles, achieving much better performance than standard local-memory reductions on hardware supporting work-group shuffles in hardware, while not being less efficient than local-memory reductions where shuffles would be emulated.

Conclusions

Finally, it should be obvious now that the choice of exposing work-group reduction and scan functions, but not work-group shuffle functions in OpenCL 2.0 results in a crippled standard:

  • it does not represent the actual capabilities of current massively parallel computational hardware, let alone the hardware we may expect in the future;
  • it effectively prevents efficient implementation of reductions and scans beyond the elementary ones (simple summation, minimum and maximum);
  • to top it all, we can scarcely expect such high-level functions to be implemented efficiently, making them effectively useless.

The obvious solution would be to provide work-group shuffle instructions at the language level. This could in fact be a core feature, since it can be supported on all hardware, just like local memory, and the device could be queries to determine if the instructions are supported in hardware or emulated (pretty much like devices can be queried to determine if local memory is physical or emulated).

Optionally, it would be nice to have some introspection to allow the developer to programmatically find the warp size (i.e. work-item concurrency granularity) used for the kernel2, and potentially improve on the use of the instructions by limiting the strides used in the shuffles.


  1. since IA intrinsically depends on directed rounding, even if support for IA was provided without explicitly exposing directed rounding, it would in fact still be possible to emulate directed rounding of scalar operations by operating on interval types and then discarding the unneeded parts of the computation; of course, this would be dramatically inefficient. ↩

  2. in practice, the existing CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE kernel property that can be programmatically queried corresponds already to the warp/wave-front size on GPUs, so there might be no need for another property if it could be guaranteed that this is the work-item dispatch granularity. ↩