pages tagged numericalwokhttp://wok.oblomov.eu/tag/numerical/wok's tag/numericalikiwiki2015-06-22T20:27:27ZRounding modes in OpenCLhttp://wok.oblomov.eu/tecnologia/gpgpu/opencl-rounding-modes/Oblomov2015-06-22T20:27:27Z2014-06-17T15:16:00Z
<h2 id="introductionhistorylost">Introduction (history lost)</h2>
<p>OpenCL 1.0 supported an <code>OPENCL SELECT_ROUNDING_MODE</code> pragma in device
code, which allowed selection of the rounding mode to be used in a
section of a kernel. The pragma was only available after enabling the
<code>cl_khr_select_fprounding_mode</code> extension. Support for this extension
and the relative pragma(s) has been removed from subsequent version of
the standard, with the result that there is <em>no way at all</em> in the
current OpenCL standard to have specific parts of a kernel use rounding
modes different from the default, except in the explicit type conversion
functions with the relevant <code>_rt*</code> suffix.</p>
<p>A consequence of this is that it is currently completely impossible to
implement robust numerical code in OpenCL.</p>
<p>In what follows I will explore some typical use cases where directed
rounding is a powerful, sometimes essential tool for numerical analysis
and scientific computing. This will be followed by a short survey of
existing hardware and software support for directed rounding. The
article ends with a discussion about what must, and what should, be
included in OpenCL to ensure it can be used as a robust scientific
programming language.</p>
<h2 id="whydirectedroundingisimportant">Why directed rounding is important</h2>
<h3 id="codeanalysis">Rationale #1: assessing numerical trustworthiness of code</h3>
<p>In his paper <a href="http://www.cs.berkeley.edu/~wkahan/Mindless.pdf">How Futile are Mindless Assessments of Roundoff in
Floating-Point Computation</a>, professor <a href="http://en.wikipedia.org/wiki/William%20Kahan">William Kahan</a>
(who helped design the IEEE-754 floating-point standard) explains that,
given multiple formulas that would compute the same quantity, the
fastest way to determine which formulas are numerically trustworthy is
to:</p>
<blockquote>
<p>Rerun each formula separately on its same input but with different
directed roundings; the first one to exhibit hypersensitivity to
roundoff is the first to suspect.</p>
</blockquote>
<p>Further along in the same paper, Kahan adds (emphasis mine):</p>
<blockquote>
<p>The goal of error-analysis is not to find errors but to fix them.
They have to be found first. The embarrassing longevity, over three
decades, of inaccurate and/ or ugly programs to compute a function so
widely used as ∠(X, Y) says something bleak about the difficulty of
floating-point error-analysis for experts and nonexperts: Without
adequate aids like <strong>redirected roundings</strong>, diagnosis and cure are
becoming practically impossible. Our failure to find errors long
suspected or known to exist is too demoralizing. We may just give up.</p>
</blockquote>
<p>Essential tools for the error-analysis of scientific computing code
cannot be implemented in OpenCL 1.1 or later (at least up to 2.0, the
latest published specification) due to the impossibility of specifying
the rounding direction.</p>
<h3 id="rationale2:enforcingnumericalcorrectness">Rationale #2: enforcing numerical correctness</h3>
<p>Directed rounding is an important tool to ensure that <em>arguments</em> to
functions with limited domain are computed in such a way that the
conditions are respected numerically when they would be analytically. To
clarify, in this section I'm talking about correctly rounding the
<em>argument</em> of a function, not its result.</p>
<p>When the argument to such a function is computed through an expression
(particularly if such an expression is ill-conditioned) whose result is
close to one of the limits of the domain, the lack of <em>correct rounding</em>
can cause the argument to be evaluated <em>just outside</em> of the domain
instead of <em>just inside</em> (which would be the analytically correct
answer). This would cause the result of the function to be Not-a-Number
instead of the correct(ly rounded) answer.</p>
<p>Common functions for which the requirements might fail to be satisfied
numerically include:</p>
<dl>
<dt><code>sqrt</code></dt>
<dd>
<p>when the <em>argument</em> would be a small, non-negative number; to
write numerically robust code one would want the <em>argument</em> to <code>sqrt</code> be
computed such that the final result is <em>towards plus infinity</em>;</p>
</dd>
<dt>inverse trigonometric functions (<code>asin</code>, <code>acos</code>, etc)</dt>
<dd>
<p>when the argument would be close to, but not greater than 1, or
close to, but not less than -1; again, to write numerically robust code
one would want the argument to be computed such that the final result
is rounded <em>towards zero</em>.</p>
</dd>
</dl>
<p>A discussion on the importance of correct rounding can again be found in
Kahan's works, see e.g. <a href="http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf">Why we needed a floating-point standard</a>.</p>
<p>Robust coding of analytically correct formulas is impossible to achieve
in OpenCL 1.1 or later (at least up to 2.0, the latest published
specification) due to the lack of support for directed rounding.</p>
<h3 id="rationale3:intervalanalysis">Rationale #3: Interval Analysis</h3>
<p>A typical example of a numerical method for which support for directed
rounding rounding modes in different parts of the computation is needed
is <a href="http://en.wikipedia.org/wiki/Interval%20Analysis">Interval Analysis</a> (IA). Similar arguments hold for other
forms of self-verified computing as well.</p>
<p>Briefly, in IA every (scalar) quantity <em>q</em> is represented by an
<strong>interval</strong> whose extrema are (representable) real numbers <em>l, u</em> such
that ‘the true value’ of <em>q</em> is <strong>guaranteed</strong> to satisfy <em>l ≤ q ≤ u</em>.</p>
<p>Operations on two intervals <em>A = [al, au]</em> and <em>B = [bl, bu]</em> must be
conducted in such a way that the resulting interval can preserve this
guarantee, and this in turn means that the lower extremum <em>must</em> be
computed in <code>rtn</code> (round towards negative infinity) mode, while the
upper extremum <em>must</em> be computed in <code>rtp</code> (round towards positive
infinity) mode.</p>
<p>For example, assuming <code>add_rtn</code> and <code>add_rtp</code> represent additions that
rounds in the suffix direction, we have that <em>C = A + B</em> could be
computed as:</p>
<pre><code>cl = add_rtn(al, bl);
cu = add_rtp(au, bu);</code></pre>
<p>In OpenCL 1.0, <code>add_rtn</code> and <code>add_rtp</code> could be defined as:</p>
<pre><code>#pragma OPENCL SELECT_ROUNDING_MODE rtn
gentype add_rtn(gentype a, gentype b) {
return a + b;
}
#pragma OPENCL SELECT_ROUNDING_MODE rtp
gentype add_rtp(gentype a, gentype b) {
return a + b;
}
/* restore default */
#pragma OPENCL SELECT_ROUNDING_MODE rte</code></pre>
<p>The same functions could be implemented in C99, in FORTRAN, in MATLAB or
even in CUDA (see below). In OpenCL 1.1 and later, this is impossible to
achieve, even on hardware that supports rounding mode selection.</p>
<h2 id="applicativeexamples">Applicative examples</h2>
<p>From the rationales presented so far, one could deduce that directed rounding
is essentially associated with the stability and robustness of numerical
code. There are however other cases where directed rounding can be used,
which are not explicitly associated with things such as roundoff errors and
error bound estimation.</p>
<h3 id="roundingdownfortheneighborslistconstructioninparticlemethods">Rounding down for the neighbors list construction in particle methods</h3>
<p>Consider for example an industrial application of mesh-less Lagrangian
such as Smoothed Particle Hydrodynamics (SPH).</p>
<p>In these numerical methods, the simulation domain is described by means of
‘particles’ free to move with respect to each other. The motion of these
particles is typically determined by the interaction between the particle and
its neighbors within a given influence sphere.</p>
<p>Checking for proximity between two particles is done by computing the length
of the relative distance vector (differences of positions), and the same
distance is often used in the actual computation of the influence between
particles. As usual, to avoid bias, both the relative distance vector and
its length should be computed with the default round-to-nearest-even
rounding mode for normal operations.</p>
<p>To avoid searching for neighbors in the whole domain for every operation,
implementations often keep a ‘neighbors list’ of each particle, constructed
by checking the proximity of candidate particles once, and storing the
indices of the particles that fall within the prescribed influence radius.</p>
<p>Due to the mesh-less nature of the method, neighborhoods may change at every
time-step, requiring a rebuild of the neighbors list. To improve
performance, this can be avoided by rebuilding the neighbors list at a lower
frequency (e.g. every 10 time-steps), assuming (only in this phase) a larger
influence radius, taking into account the maximum length that might be
traveled by a particle in the given number of time-steps.</p>
<p>When such a strategy is adopted, neighbors need to be re-checked for
actual proximity during normal operations, so that, for maximum efficiency, a
delicate balance must be found between the reduced frequency and the
increased number of potential neighbors caused by the enlarged influence
radius.</p>
<p>One way to improve efficiency in this sense is to <em>round towards zero</em> the
computation of the relative distance vector and its length during neighbors
list construction: this maximizes the impact of the enlarged influence
radius by including potential neighbors which are within one or two ULPs.
This allows the use of very tight bounds on how much to enlarge the
influence radius, without loss of correctness in the simulations.</p>
<h2 id="directedroundingsupport">Directed rounding support</h2>
<h3 id="hardwaresupportfordirectedrounding">Hardware support for directed rounding</h3>
<p>x86-compatible CPUs have had support for setting the rounding mode by setting
the appropriate flags in the control registers (either the x87 control word for
FPU, or the MXCSR control register for SSE). Similarly, on ARM CPUs with
support for the NEON or VFP instruction set, the rounding mode can be set with
appropriate flags in the FPCSR</p>
<p>AMD GPUs also have support for rounding modes selection, with the granularity
of an ALU clause. As documented in the <a href="http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/#isa">corresponding reference
manuals</a>, the TeraScale 2 and TeraScale 3 architectures support
setting the general rounding mode for ALU clauses via the <code>SET_MODE</code>
instruction; Graphics Core Next (GCN) architectures can control the rounding
mode by setting the appropriate bits in the <code>MODE</code> register via the
<code>S_SETREG</code> instruction.</p>
<p>Additionally, the following hardware is capable of directed rounding at
the instruction level:</p>
<dl>
<dt>CUDA-enabled NVIDIA GPUs</dt>
<dd>
<p>as documented in the <a href="http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions">CUDA C Programming Guide, Appendix D.2
(Intrinsic Functions)</a>, some intrinsic functions can be suffixed with one of
<code>_rn</code>,<code>_rz</code>,<code>_ru</code>,<code>_rd</code> to explicitly set the rounding mode of the function;</p>
</dd>
<dt>CPUs with support for the AVX-512 instruction set</dt>
<dd>
<p>the <code>EVEX</code> prefix introduced with AVX-512 supports the rounding mode
to be set explicitly for any given instruction, overriding the MXCSR control
register, as documented in <a href="http://software.intel.com/en-us/file/319433-017pdf">the Intel® Architecture Instruction Set Extensions
Programming Reference</a>, section 4.6.2: “Static Rounding Support in
EVEX”.</p>
</dd>
</dl>
<h3 id="softwaresupportfordirectedrounding">Software support for directed rounding</h3>
<p>At the software level, support for the rounding mode at the processor
level can be accessed in C99 and C++11 by enabling the <code>STDC
FENV_ACCESS</code> pragma and using <code>fesetenv()</code> (and its counterpart
<code>fegetenv()</code>).</p>
<p>In MATLAB, the rounding mode can be selected by the
<code>system_dependent('setround', ·)</code> command.</p>
<p>Some FORTRAN implementations also offer functions to get and set the
current rounding mode (e.g. IBM's XL FORTRAN offers <code>fpgets</code> and
<code>fpsets</code>).</p>
<p>CUDA C exposes the intrinsic functions of CUDA-enabled GPUs that support
explicit rounding modes. So, for example, <code>__add_ru(a, b)</code> (<em>resp.</em>
<code>__add_rd(a, b)</code>) can be used in CUDA C to obtain the sum of <code>a</code> and <code>b</code>
rounded up (<em>resp.</em> down) without having to change the rounding mode of
the whole GPU.</p>
<p>Even the GNU implementation of the text-processing language Awk has a method
to set the rounding mode in floating-point operations, <a href="https://www.gnu.org/software/gawk/manual/html_node/Setting-Rounding-Mode.html">via the <code>ROUNDMODE</code>
variable</a>.</p>
<p>All in all, OpenCL (since 1.1 on) seems to be the only language/API to
<em>not</em> support directed rounding.</p>
<h2 id="whatcanbedoneforopencl">What can be done for OpenCL</h2>
<p>In its present state, OpenCL 1.1 to 2.0 are lagging behind C99, C++11,
FORTRAN, MATLAB and CUDA (at the very least) by lacking support for
directed rounding. This effectively prevents robust numerical code to be
implemented and analyzed in OpenCL.</p>
<p>While I can understand that <em>core</em> support for directed rounding in
OpenCL is a bit of a stretch, considering the wide range of hardware that
support the specification, I believe that the standard should provide an
official extension to (re)introduce support for it. This could be done
by re-instating the <code>cl_khr_select_fprounding_mode</code> extension, or
through a different extension with better semantics (for example,
modelled around the C99/C++11 <code>STDC FENV_ACCESS</code> pragma).</p>
<p>This is the minimum requirement to bring OpenCL C on par with C and C++
as a language for scientific computing.</p>
<p>Ideally (potentially through a different extension), it would be nice to
also have explicit support for instruction-level rounding mode selection
independently from the current rounding mode, with intrinsics similar to
the ones that OpenCL defines already for the conversion functions.
On supporting hardware,
this would make it possible to implement even more efficient, yet still
robust numerical code needing different rounding modes for separate
subexpression.</p>
<h3 id="granularity">Granularity</h3>
<p>When it comes to the OpenCL programming model, it's important to specify
the scope of application of state changes, of which the rounding mode is
one. Given the use cases discussed above, we could say that the minimum
requirement would be for OpenCL to support changing the rounding mode
<em>during kernel execution</em>, and <em>for the whole launch grid</em> to a value
known <em>at (kernel) compile time</em>.</p>
<p>So, it should be possible (when the appropriate extension is supported
and enabled) to change rounding mode half-way through a kernel. The new:</p>
<pre><code>kernel some_kern(...) {
/* kernels start in some default rounding mode,
* e.g. round-to-nearest-even. We do some calculations
* in this default mode:
*/
do_something ;
/* now we change to some other mode. of course this
* is just pseudo-syntax:
*/
change_rounding_mode(towards_zero);
/* and now we do more calculations, this time
* with the new rounding mode enabled:
*/
do_something_else;
}</code></pre>
<p>The minimum supported granularity would thus be the whole launch grid,
as long as the rounding mode can be changed dynamically during kernel
execution, to (any) value known at compile time.</p>
<p>Of course, a finer granularity and a more relaxed (i.e. runtime)
selection of the rounding mode would be interesting additional features.
These may be made optional, and the hardware capability in this regard
could be queried through appropriate device properties.</p>
<p>For example, considering the standard execution model for OpenCL, with
work-groups mapped to compute units, it might make sense to support a
granularity at the work-group level. This would be a nice addition,
since it would allow e.g. to <em>concurrently</em> run the same code with
different rounding modes (one per work-group), which would benefit
applications geared towards the analysis of the stability of numerical
code (as discussed in <a href="http://wok.oblomov.eu/tag/numerical/#codeanalysis">Rationale #1</a>). But it's not
strictly necessary.</p>