Rounding modes in OpenCL

Rounding modes in OpenCL

Introduction (history lost)

OpenCL 1.0 supported an OPENCL SELECT_ROUNDING_MODE pragma in device code, which allowed selection of the rounding mode to be used in a section of a kernel. The pragma was only available after enabling the cl_khr_select_fprounding_mode extension. Support for this extension and the relative pragma(s) has been removed from subsequent version of the standard, with the result that there is no way at all in the current OpenCL standard to have specific parts of a kernel use rounding modes different from the default, except in the explicit type conversion functions with the relevant _rt* suffix.

A consequence of this is that it is currently completely impossible to implement robust numerical code in OpenCL.

In what follows I will explore some typical use cases where directed rounding is a powerful, sometimes essential tool for numerical analysis and scientific computing. This will be followed by a short survey of existing hardware and software support for directed rounding. The article ends with a discussion about what must, and what should, be included in OpenCL to ensure it can be used as a robust scientific programming language.

Why directed rounding is important

Rationale #1: assessing numerical trustworthiness of code

In his paper How Futile are Mindless Assessments of Roundoff in Floating-Point Computation, professor William Kahan (who helped design the IEEE-754 floating-point standard) explains that, given multiple formulas that would compute the same quantity, the fastest way to determine which formulas are numerically trustworthy is to:

Rerun each formula separately on its same input but with different directed roundings; the first one to exhibit hypersensitivity to roundoff is the first to suspect.

Further along in the same paper, Kahan adds (emphasis mine):

The goal of error-analysis is not to find errors but to fix them. They have to be found first. The embarrassing longevity, over three decades, of inaccurate and/ or ugly programs to compute a function so widely used as ∠(X, Y) says something bleak about the difficulty of floating-point error-analysis for experts and nonexperts: Without adequate aids like redirected roundings, diagnosis and cure are becoming practically impossible. Our failure to find errors long suspected or known to exist is too demoralizing. We may just give up.

Essential tools for the error-analysis of scientific computing code cannot be implemented in OpenCL 1.1 or later (at least up to 2.0, the latest published specification) due to the impossibility of specifying the rounding direction.

Rationale #2: enforcing numerical correctness

Directed rounding is an important tool to ensure that arguments to functions with limited domain are computed in such a way that the conditions are respected numerically when they would be analytically. To clarify, in this section I'm talking about correctly rounding the argument of a function, not its result.

When the argument to such a function is computed through an expression (particularly if such an expression is ill-conditioned) whose result is close to one of the limits of the domain, the lack of correct rounding can cause the argument to be evaluated just outside of the domain instead of just inside (which would be the analytically correct answer). This would cause the result of the function to be Not-a-Number instead of the correct(ly rounded) answer.

Common functions for which the requirements might fail to be satisfied numerically include:

sqrt: when the argument would be a small, non-negative number; to write numerically robust code one would want the argument to sqrt be computed such that the final result is towards plus infinity;
inverse trigonometric functions (asin, acos, etc): when the argument would be close to, but not greater than 1, or close to, but not less than -1; again, to write numerically robust code one would want the argument to be computed such that the final result is rounded towards zero.

A discussion on the importance of correct rounding can again be found in Kahan's works, see e.g. Why we needed a floating-point standard.

Robust coding of analytically correct formulas is impossible to achieve in OpenCL 1.1 or later (at least up to 2.0, the latest published specification) due to the lack of support for directed rounding.

Rationale #3: Interval Analysis

A typical example of a numerical method for which support for directed rounding rounding modes in different parts of the computation is needed is Interval Analysis (IA). Similar arguments hold for other forms of self-verified computing as well.

Briefly, in IA every (scalar) quantity q is represented by an interval whose extrema are (representable) real numbers l, u such that ‘the true value’ of q is guaranteed to satisfy l ≤ q ≤ u.

Operations on two intervals A = [al, au] and B = [bl, bu] must be conducted in such a way that the resulting interval can preserve this guarantee, and this in turn means that the lower extremum must be computed in rtn (round towards negative infinity) mode, while the upper extremum must be computed in rtp (round towards positive infinity) mode.

For example, assuming add_rtn and add_rtp represent additions that rounds in the suffix direction, we have that C = A + B could be computed as:

cl = add_rtn(al, bl);
cu = add_rtp(au, bu);

In OpenCL 1.0, add_rtn and add_rtp could be defined as:

#pragma OPENCL SELECT_ROUNDING_MODE rtn
gentype add_rtn(gentype a, gentype b) {
 return a + b;
}
#pragma OPENCL SELECT_ROUNDING_MODE rtp
gentype add_rtp(gentype a, gentype b) {
 return a + b;
}
/* restore default */
#pragma OPENCL SELECT_ROUNDING_MODE rte

The same functions could be implemented in C99, in FORTRAN, in MATLAB or even in CUDA (see below). In OpenCL 1.1 and later, this is impossible to achieve, even on hardware that supports rounding mode selection.

Applicative examples

From the rationales presented so far, one could deduce that directed rounding is essentially associated with the stability and robustness of numerical code. There are however other cases where directed rounding can be used, which are not explicitly associated with things such as roundoff errors and error bound estimation.

Rounding down for the neighbors list construction in particle methods

Consider for example an industrial application of mesh-less Lagrangian such as Smoothed Particle Hydrodynamics (SPH).

In these numerical methods, the simulation domain is described by means of ‘particles’ free to move with respect to each other. The motion of these particles is typically determined by the interaction between the particle and its neighbors within a given influence sphere.

Checking for proximity between two particles is done by computing the length of the relative distance vector (differences of positions), and the same distance is often used in the actual computation of the influence between particles. As usual, to avoid bias, both the relative distance vector and its length should be computed with the default round-to-nearest-even rounding mode for normal operations.

To avoid searching for neighbors in the whole domain for every operation, implementations often keep a ‘neighbors list’ of each particle, constructed by checking the proximity of candidate particles once, and storing the indices of the particles that fall within the prescribed influence radius.

Due to the mesh-less nature of the method, neighborhoods may change at every time-step, requiring a rebuild of the neighbors list. To improve performance, this can be avoided by rebuilding the neighbors list at a lower frequency (e.g. every 10 time-steps), assuming (only in this phase) a larger influence radius, taking into account the maximum length that might be traveled by a particle in the given number of time-steps.

When such a strategy is adopted, neighbors need to be re-checked for actual proximity during normal operations, so that, for maximum efficiency, a delicate balance must be found between the reduced frequency and the increased number of potential neighbors caused by the enlarged influence radius.

One way to improve efficiency in this sense is to round towards zero the computation of the relative distance vector and its length during neighbors list construction: this maximizes the impact of the enlarged influence radius by including potential neighbors which are within one or two ULPs. This allows the use of very tight bounds on how much to enlarge the influence radius, without loss of correctness in the simulations.

Directed rounding support

Hardware support for directed rounding

x86-compatible CPUs have had support for setting the rounding mode by setting the appropriate flags in the control registers (either the x87 control word for FPU, or the MXCSR control register for SSE). Similarly, on ARM CPUs with support for the NEON or VFP instruction set, the rounding mode can be set with appropriate flags in the FPCSR

AMD GPUs also have support for rounding modes selection, with the granularity of an ALU clause. As documented in the corresponding reference manuals, the TeraScale 2 and TeraScale 3 architectures support setting the general rounding mode for ALU clauses via the SET_MODE instruction; Graphics Core Next (GCN) architectures can control the rounding mode by setting the appropriate bits in the MODE register via the S_SETREG instruction.

Additionally, the following hardware is capable of directed rounding at the instruction level:

CUDA-enabled NVIDIA GPUs: as documented in the CUDA C Programming Guide, Appendix D.2 (Intrinsic Functions), some intrinsic functions can be suffixed with one of _rn,_rz,_ru,_rd to explicitly set the rounding mode of the function;
CPUs with support for the AVX-512 instruction set: the EVEX prefix introduced with AVX-512 supports the rounding mode to be set explicitly for any given instruction, overriding the MXCSR control register, as documented in the Intel® Architecture Instruction Set Extensions Programming Reference, section 4.6.2: “Static Rounding Support in EVEX”.

Software support for directed rounding

At the software level, support for the rounding mode at the processor level can be accessed in C99 and C++11 by enabling the STDC FENV_ACCESS pragma and using fesetenv() (and its counterpart fegetenv()).

In MATLAB, the rounding mode can be selected by the system_dependent('setround', ·) command.

Some FORTRAN implementations also offer functions to get and set the current rounding mode (e.g. IBM's XL FORTRAN offers fpgets and fpsets).

CUDA C exposes the intrinsic functions of CUDA-enabled GPUs that support explicit rounding modes. So, for example, __add_ru(a, b) (resp. __add_rd(a, b)) can be used in CUDA C to obtain the sum of a and b rounded up (resp. down) without having to change the rounding mode of the whole GPU.

Even the GNU implementation of the text-processing language Awk has a method to set the rounding mode in floating-point operations, via the ROUNDMODE variable.

All in all, OpenCL (since 1.1 on) seems to be the only language/API to not support directed rounding.

What can be done for OpenCL

In its present state, OpenCL 1.1 to 2.0 are lagging behind C99, C++11, FORTRAN, MATLAB and CUDA (at the very least) by lacking support for directed rounding. This effectively prevents robust numerical code to be implemented and analyzed in OpenCL.

While I can understand that core support for directed rounding in OpenCL is a bit of a stretch, considering the wide range of hardware that support the specification, I believe that the standard should provide an official extension to (re)introduce support for it. This could be done by re-instating the cl_khr_select_fprounding_mode extension, or through a different extension with better semantics (for example, modelled around the C99/C++11 STDC FENV_ACCESS pragma).

This is the minimum requirement to bring OpenCL C on par with C and C++ as a language for scientific computing.

Ideally (potentially through a different extension), it would be nice to also have explicit support for instruction-level rounding mode selection independently from the current rounding mode, with intrinsics similar to the ones that OpenCL defines already for the conversion functions. On supporting hardware, this would make it possible to implement even more efficient, yet still robust numerical code needing different rounding modes for separate subexpression.

Granularity

When it comes to the OpenCL programming model, it's important to specify the scope of application of state changes, of which the rounding mode is one. Given the use cases discussed above, we could say that the minimum requirement would be for OpenCL to support changing the rounding mode during kernel execution, and for the whole launch grid to a value known at (kernel) compile time.

So, it should be possible (when the appropriate extension is supported and enabled) to change rounding mode half-way through a kernel. The new:

kernel some_kern(...) {
    /* kernels start in some default rounding mode,
     * e.g. round-to-nearest-even. We do some calculations
     * in this default mode:
     */
    do_something ;

    /* now we change to some other mode. of course this
     * is just pseudo-syntax:
     */
    change_rounding_mode(towards_zero);

    /* and now we do more calculations, this time
     * with the new rounding mode enabled:
     */
    do_something_else;
}

The minimum supported granularity would thus be the whole launch grid, as long as the rounding mode can be changed dynamically during kernel execution, to (any) value known at compile time.

Of course, a finer granularity and a more relaxed (i.e. runtime) selection of the rounding mode would be interesting additional features. These may be made optional, and the hardware capability in this regard could be queried through appropriate device properties.

For example, considering the standard execution model for OpenCL, with work-groups mapped to compute units, it might make sense to support a granularity at the work-group level. This would be a nice addition, since it would allow e.g. to concurrently run the same code with different rounding modes (one per work-group), which would benefit applications geared towards the analysis of the stability of numerical code (as discussed in Rationale #1). But it's not strictly necessary.