Introduction (history lost)
OpenCL 1.0 supported an OPENCL SELECT_ROUNDING_MODE
pragma in device
code, which allowed selection of the rounding mode to be used in a
section of a kernel. The pragma was only available after enabling the
cl_khr_select_fprounding_mode
extension. Support for this extension
and the relative pragma(s) has been removed from subsequent version of
the standard, with the result that there is no way at all in the
current OpenCL standard to have specific parts of a kernel use rounding
modes different from the default, except in the explicit type conversion
functions with the relevant _rt*
suffix.
A consequence of this is that it is currently completely impossible to implement robust numerical code in OpenCL.
In what follows I will explore some typical use cases where directed rounding is a powerful, sometimes essential tool for numerical analysis and scientific computing. This will be followed by a short survey of existing hardware and software support for directed rounding. The article ends with a discussion about what must, and what should, be included in OpenCL to ensure it can be used as a robust scientific programming language.
Why directed rounding is important
Rationale #1: assessing numerical trustworthiness of code
In his paper How Futile are Mindless Assessments of Roundoff in FloatingPoint Computation, professor William Kahan (who helped design the IEEE754 floatingpoint standard) explains that, given multiple formulas that would compute the same quantity, the fastest way to determine which formulas are numerically trustworthy is to:
Rerun each formula separately on its same input but with different directed roundings; the first one to exhibit hypersensitivity to roundoff is the first to suspect.
Further along in the same paper, Kahan adds (emphasis mine):
The goal of erroranalysis is not to find errors but to fix them. They have to be found first. The embarrassing longevity, over three decades, of inaccurate and/ or ugly programs to compute a function so widely used as ∠(X, Y) says something bleak about the difficulty of floatingpoint erroranalysis for experts and nonexperts: Without adequate aids like redirected roundings, diagnosis and cure are becoming practically impossible. Our failure to find errors long suspected or known to exist is too demoralizing. We may just give up.
Essential tools for the erroranalysis of scientific computing code cannot be implemented in OpenCL 1.1 or later (at least up to 2.0, the latest published specification) due to the impossibility of specifying the rounding direction.
Rationale #2: enforcing numerical correctness
Directed rounding is an important tool to ensure that arguments to functions with limited domain are computed in such a way that the conditions are respected numerically when they would be analytically. To clarify, in this section I'm talking about correctly rounding the argument of a function, not its result.
When the argument to such a function is computed through an expression (particularly if such an expression is illconditioned) whose result is close to one of the limits of the domain, the lack of correct rounding can cause the argument to be evaluated just outside of the domain instead of just inside (which would be the analytically correct answer). This would cause the result of the function to be NotaNumber instead of the correct(ly rounded) answer.
Common functions for which the requirements might fail to be satisfied numerically include:
sqrt

when the argument would be a small, nonnegative number; to write numerically robust code one would want the argument to
sqrt
be computed such that the final result is towards plus infinity;  inverse trigonometric functions (
asin
,acos
, etc) 
when the argument would be close to, but not greater than 1, or close to, but not less than 1; again, to write numerically robust code one would want the argument to be computed such that the final result is rounded towards zero.
A discussion on the importance of correct rounding can again be found in Kahan's works, see e.g. Why we needed a floatingpoint standard.
Robust coding of analytically correct formulas is impossible to achieve in OpenCL 1.1 or later (at least up to 2.0, the latest published specification) due to the lack of support for directed rounding.
Rationale #3: Interval Analysis
A typical example of a numerical method for which support for directed rounding rounding modes in different parts of the computation is needed is Interval Analysis (IA). Similar arguments hold for other forms of selfverified computing as well.
Briefly, in IA every (scalar) quantity q is represented by an interval whose extrema are (representable) real numbers l, u such that ‘the true value’ of q is guaranteed to satisfy l ≤ q ≤ u.
Operations on two intervals A = [al, au] and B = [bl, bu] must be
conducted in such a way that the resulting interval can preserve this
guarantee, and this in turn means that the lower extremum must be
computed in rtn
(round towards negative infinity) mode, while the
upper extremum must be computed in rtp
(round towards positive
infinity) mode.
For example, assuming add_rtn
and add_rtp
represent additions that
rounds in the suffix direction, we have that C = A + B could be
computed as:
cl = add_rtn(al, bl);
cu = add_rtp(au, bu);
In OpenCL 1.0, add_rtn
and add_rtp
could be defined as:
#pragma OPENCL SELECT_ROUNDING_MODE rtn
gentype add_rtn(gentype a, gentype b) {
return a + b;
}
#pragma OPENCL SELECT_ROUNDING_MODE rtp
gentype add_rtp(gentype a, gentype b) {
return a + b;
}
/* restore default */
#pragma OPENCL SELECT_ROUNDING_MODE rte
The same functions could be implemented in C99, in FORTRAN, in MATLAB or even in CUDA (see below). In OpenCL 1.1 and later, this is impossible to achieve, even on hardware that supports rounding mode selection.
Applicative examples
From the rationales presented so far, one could deduce that directed rounding is essentially associated with the stability and robustness of numerical code. There are however other cases where directed rounding can be used, which are not explicitly associated with things such as roundoff errors and error bound estimation.
Rounding down for the neighbors list construction in particle methods
Consider for example an industrial application of meshless Lagrangian such as Smoothed Particle Hydrodynamics (SPH).
In these numerical methods, the simulation domain is described by means of ‘particles’ free to move with respect to each other. The motion of these particles is typically determined by the interaction between the particle and its neighbors within a given influence sphere.
Checking for proximity between two particles is done by computing the length of the relative distance vector (differences of positions), and the same distance is often used in the actual computation of the influence between particles. As usual, to avoid bias, both the relative distance vector and its length should be computed with the default roundtonearesteven rounding mode for normal operations.
To avoid searching for neighbors in the whole domain for every operation, implementations often keep a ‘neighbors list’ of each particle, constructed by checking the proximity of candidate particles once, and storing the indices of the particles that fall within the prescribed influence radius.
Due to the meshless nature of the method, neighborhoods may change at every timestep, requiring a rebuild of the neighbors list. To improve performance, this can be avoided by rebuilding the neighbors list at a lower frequency (e.g. every 10 timesteps), assuming (only in this phase) a larger influence radius, taking into account the maximum length that might be traveled by a particle in the given number of timesteps.
When such a strategy is adopted, neighbors need to be rechecked for actual proximity during normal operations, so that, for maximum efficiency, a delicate balance must be found between the reduced frequency and the increased number of potential neighbors caused by the enlarged influence radius.
One way to improve efficiency in this sense is to round towards zero the computation of the relative distance vector and its length during neighbors list construction: this maximizes the impact of the enlarged influence radius by including potential neighbors which are within one or two ULPs. This allows the use of very tight bounds on how much to enlarge the influence radius, without loss of correctness in the simulations.
Directed rounding support
Hardware support for directed rounding
x86compatible CPUs have had support for setting the rounding mode by setting the appropriate flags in the control registers (either the x87 control word for FPU, or the MXCSR control register for SSE). Similarly, on ARM CPUs with support for the NEON or VFP instruction set, the rounding mode can be set with appropriate flags in the FPCSR
AMD GPUs also have support for rounding modes selection, with the granularity
of an ALU clause. As documented in the corresponding reference
manuals, the TeraScale 2 and TeraScale 3 architectures support
setting the general rounding mode for ALU clauses via the SET_MODE
instruction; Graphics Core Next (GCN) architectures can control the rounding
mode by setting the appropriate bits in the MODE
register via the
S_SETREG
instruction.
Additionally, the following hardware is capable of directed rounding at the instruction level:
 CUDAenabled NVIDIA GPUs

as documented in the CUDA C Programming Guide, Appendix D.2 (Intrinsic Functions), some intrinsic functions can be suffixed with one of
_rn
,_rz
,_ru
,_rd
to explicitly set the rounding mode of the function;  CPUs with support for the AVX512 instruction set

the
EVEX
prefix introduced with AVX512 supports the rounding mode to be set explicitly for any given instruction, overriding the MXCSR control register, as documented in the Intel® Architecture Instruction Set Extensions Programming Reference, section 4.6.2: “Static Rounding Support in EVEX”.
Software support for directed rounding
At the software level, support for the rounding mode at the processor
level can be accessed in C99 and C++11 by enabling the STDC
FENV_ACCESS
pragma and using fesetenv()
(and its counterpart
fegetenv()
).
In MATLAB, the rounding mode can be selected by the
system_dependent('setround', ·)
command.
Some FORTRAN implementations also offer functions to get and set the
current rounding mode (e.g. IBM's XL FORTRAN offers fpgets
and
fpsets
).
CUDA C exposes the intrinsic functions of CUDAenabled GPUs that support
explicit rounding modes. So, for example, __add_ru(a, b)
(resp.
__add_rd(a, b)
) can be used in CUDA C to obtain the sum of a
and b
rounded up (resp. down) without having to change the rounding mode of
the whole GPU.
Even the GNU implementation of the textprocessing language Awk has a method
to set the rounding mode in floatingpoint operations, via the ROUNDMODE
variable.
All in all, OpenCL (since 1.1 on) seems to be the only language/API to not support directed rounding.
What can be done for OpenCL
In its present state, OpenCL 1.1 to 2.0 are lagging behind C99, C++11, FORTRAN, MATLAB and CUDA (at the very least) by lacking support for directed rounding. This effectively prevents robust numerical code to be implemented and analyzed in OpenCL.
While I can understand that core support for directed rounding in
OpenCL is a bit of a stretch, considering the wide range of hardware that
support the specification, I believe that the standard should provide an
official extension to (re)introduce support for it. This could be done
by reinstating the cl_khr_select_fprounding_mode
extension, or
through a different extension with better semantics (for example,
modelled around the C99/C++11 STDC FENV_ACCESS
pragma).
This is the minimum requirement to bring OpenCL C on par with C and C++ as a language for scientific computing.
Ideally (potentially through a different extension), it would be nice to also have explicit support for instructionlevel rounding mode selection independently from the current rounding mode, with intrinsics similar to the ones that OpenCL defines already for the conversion functions. On supporting hardware, this would make it possible to implement even more efficient, yet still robust numerical code needing different rounding modes for separate subexpression.
Granularity
When it comes to the OpenCL programming model, it's important to specify the scope of application of state changes, of which the rounding mode is one. Given the use cases discussed above, we could say that the minimum requirement would be for OpenCL to support changing the rounding mode during kernel execution, and for the whole launch grid to a value known at (kernel) compile time.
So, it should be possible (when the appropriate extension is supported and enabled) to change rounding mode halfway through a kernel. The new:
kernel some_kern(...) {
/* kernels start in some default rounding mode,
* e.g. roundtonearesteven. We do some calculations
* in this default mode:
*/
do_something ;
/* now we change to some other mode. of course this
* is just pseudosyntax:
*/
change_rounding_mode(towards_zero);
/* and now we do more calculations, this time
* with the new rounding mode enabled:
*/
do_something_else;
}
The minimum supported granularity would thus be the whole launch grid, as long as the rounding mode can be changed dynamically during kernel execution, to (any) value known at compile time.
Of course, a finer granularity and a more relaxed (i.e. runtime) selection of the rounding mode would be interesting additional features. These may be made optional, and the hardware capability in this regard could be queried through appropriate device properties.
For example, considering the standard execution model for OpenCL, with workgroups mapped to compute units, it might make sense to support a granularity at the workgroup level. This would be a nice addition, since it would allow e.g. to concurrently run the same code with different rounding modes (one per workgroup), which would benefit applications geared towards the analysis of the stability of numerical code (as discussed in Rationale #1). But it's not strictly necessary.