GPGPU is essentially composed of two layers: the hardware on which it runs, and the software tools used to control and program that hardware. Although the two are often tightly integrated, it is a good idea to keep them separate when evaluating the quality of both and looking for the best options.

In this article I will look into the (current at the time of writing) major players both in the software and hardware fields, with a comparison of some of the key up and downsides of each.

Hardware

The two major players in the GPGPU market are NVIDIA and AMD, the latter since the acquisition of ATI. More recently, Intel has also started to put a foot in the market, with their latest HD4000 integrated GPU series sporting OpenCL support (on Windows). I will focus my comparison on the products of the first two companies, for two reasons: I don't have direct experience on the Intel GPU (yet), and the current operating system limitation effectively takes Intel out of the General computing scene.

NVIDIA (CUDA)

NVIDIA was in a sense the first company to go public with consumer-grade GPGPU products and support (G80 GeForce series, in late 2006). The technology behind its GPGPU support is called CUDA, which was initially backronymed as Compute Unified Device Architecture (although NVIDIA itself has stopped acknowledging the expansion, sticking simply to CUDA), whose base is a unified shader architecture coupled with the ability to execute kernels independently of the standard rendering pipeline.

Since the initial release of CUDA devices, multiple hardware generations have been put in production, with different capabilities. These different generations are programmatically exposed in terms of ‘CUDA capability’ of the device, a versioning system where the major number denotes the generation, and the minor number denotes variations of the generation: so for example devices with CUDA capability 1.0, 1.1, 1.2 and 1.3 are all first-generation, devices with CUDA capability 2.0 and 2.1 are second-generation, and so on.

New CUDA generations introduce significant overhauls of the internal design, and usually require code changes to fully take advantage of the new capabilities. Minor stepping within the same generation can still bring significant benefits, while preserving the same overall hardware design and without particular efforts on the programmer side (an example of this has been the move from 1.0 and 1.1 to 1.2 and 1.3, where the improved memory coalescing rules brought a large performance boost without any need for code redesigns, as opposed to the move from 1.x to 2.x, where the introduction of L1 cache required code rewriting to fully take advantage of the new capability).

CUDA hardware was born with a rather simple design. Each card is equipped with one or more multiprocessors (MP), each composed of a number of (scalar) compute cores. Although NVIDIA (for marketing reasons) has always advertised the number of CUDA cores present in their cards (number that has recently reached the thousands), the number of multiprocessor is often a much more important factor in the code performance, except possibly for the simplest kernels, as we shall see further on.

Each multiprocessor also holds a certain amount of (scalar, 4-byte wide) registers (typically in the order of the tens of thousands); although the register file exists at the multiprocessor level, registers are assigned privately to each core, so that during executions cores cannot read from or write to variables stored in the registers assigned to a different core (at least until the introduction of third-generation hardware, which comes equipped with limited support for this functionality).

Multiprocessors also have a certain amount of local data share (LDS, called shared memory in CUDA lingo) which is concurrently accessible, in reading and writing, from all cores on the same multiprocessor. In second-generation architectures and later, the LDS memory banks also act as a (mostly hardware-controlled) L1 cache for global memory.

To understand the importance of the role of the multiprocessor, we look at the execution model for the (compute) shaders in CUDA, focusing initially on first-generation hardware.

In this case (hardware with CUDA capability 1.x) each MP holds 8 computing cores that share the same instruction pointer: at every step, all cores in a single multiprocessor therefore execute the same instruction, potentially on different data, in typical SIMD fashion. The actual SIMD width (called warp size) in CUDA is 32, meaning that each core actually executes the same instruction 4 times.

The situation is only marginally different for later generations. For example, a second-generation MP has 32 (CUDA capability 2.0) or 48 cores (CUDA capability 2.1), but each warp is executed by going twice over a set of 16 cores, so that a single MP can run 2 warps concurrently1.

Despite the efforts done by NVIDIA to focus on the ability of its hardware to run multiple ‘threads’ concurrently (one for each CUDA core), it's therefore extremely important to focus on the fundamental logical execution unit, which is the warp being executed by groups of compute cores. Of course, when all ‘threads’ run the same instruction you get performance which is roughly proportional to the number of cores; but in anything more complex than that (for example, anything that involves reductions), what really matters is the number of warps that can be executed concurrently bu the card, i.e. the number of MPs times the number of concurrent warps per MP.

Another important aspect in GPU performance is ‘overcommitting’: if a MP could only manage a single warp, it would stall doing nothing while waiting for data from global memory needed by a warp; but the MP in CUDA cards are designed to be able to ‘hold’ multiple resident warps, so that they can switch out of stalled warps to serve warps that can run.

The total number of warps that can be resident on a MP is limited by a number of factors: a hardware limit, but also the amount of resources (registers, shared memory) consumed by a warp with respect to the total number of resources available on the MP. Once again, the MP becomes an important limiting factor.

For example, from the second to the third generation of CUDA cards there has been a huge increment in the number of cores (192 cores per MP vs 32 or 48 cores per MP, i.e. a factor of 4 to 6), but the available resources haven't grown accordingly (registers went from 32K to 64K, maximum number of resident warps went from 48 to 64, and LDS amount has remained constant: 48K for LDS and L1 cache), and the number of warps that can be run concurrently has doubled, 4 warps versus the 2 warps of the second generation2. This makes overcommitting harder, with a consequent higher risk of having the MP stall: even though warp execution is much faster, it becomes harder to exploit the full computational power of the device in non-trivial scenarios.

On the other hand, third-generation hardware also provides some powerful features that, when used, can significantly reduce the pressure on shared resources consumption: a very important feature in this regard is the ability for threads to access the private registers of other threads in the same warp3, that can significantly reduce the need for LDS usage in many typical usages (e.g. reductions).

{}

AMD / ATI (Stream)

AMD entered the GPGPU market with the acquisition of ATI, which happened more or less at the same time the acquired company had started looking seriously into GPGPU. In fact, ATI could be said to have entered the GPGPU market before NVIDIA, with dedicated hardware (ATI FireStream) based on their current (at the time) GPU design. However, it wasn't until mid-2007, and the introduction of the RV600 series of GPUs, that ATI/AMD started producing consumer-grade GPGPU solutions.

The Stream architecture underlying ATI/AMD GPGPU-capable cards is significantly more complex than the competition, while sharing some abstract design principles with it. Each card is equipped with one or more multiprocessors (MP, which AMD calls SIMD engines), each composed of (typically4) 16 thread processors (TP), each clustering 4 or 5 compute cores5. The compute cores in each of these clusters are designed to work together, and the instruction set provides support for both independent scalar operations and actual vector operations (e.g. dot products). While initially the design was based on the VLIW principle, the latest generation moved to a simpler SIMD instruction set, for reasons discussed below.

The fundamental logical execution unit of Stream MPs is called a wavefront, with a typical4 size of 64 threads. Wavefronts in Stream still clearly show their graphical origin, especially in the earlier designs: a wavefront is decomposed in quads (corresponding to e.g. a square of 4 adjacent pixels for pixel shaders), with a SIMD engine processing (typically4) 4 quads per cycle (16 threads), alternating between two wavefronts, with 4 cycles completing a wavefront.

The much more complex hardware of ATI cards has consistently been a double-edged sword.

On the one hand, with the much higher number of cores (ALUs) these cards have been able to achieve higher (theoretical) peak performance, even at lower frequencies than the competition; the ALU clustering and vector computing capability also allowed ATI to implement hardware support for double-precision floating-point much earlier than the competition, even on consumer-grade GPUs.

On the other hand, the VLIW instruction set used in the first generations has made the full computational power of an ATI card harder to exploit, often requiring manual vectorization of the code or careful coding to ensure instruction independence to achieve the appropriate instruction density that would keep all the cores busy during a wavefront.

These issues have been progressively solved first with a move from a 5-channel architecture to a 4-channel architecture (still using a VLIW instruction set), and finally with the move to a simpler SIMD instruction set5, allowing simpler assembly and decoding and a finer granularity at the dispatch level.

Memory-wise, the multiprocessors are equipped with (vector) registers, which can be treated either as 4 32-bit component vectors or, on hardware that supports it, 2 64-bit component vectors; on the latest hardware generation, there are also scalar registers reserved for use by the scalar ALU that complements the vector ALUs. In addition to a LDS buffer on each multiprocessor, recent generations of Stream GPUs also feature a unique global data share (GDS) buffer, accessible by all multiprocessors, whose purpose is to exchange data and synchronize wavefronts across different multiprocessors.

{}

Software

The software side of GPGPU is actually composed of two distinct parts: host-side APIs, and device-side programming language(s).

The API (Application Programming Interface) is a collection of functions that the host uses to manage its connection to the device: it includes calls to allocate memory on the device, calls to load kernels, calls to launch kernels, and so on and so forth. Device-side programming, on the other hand, consists in writing the actual kernels that are going to be executed on the device.

CUDA

{}

CAL

{}

OpenCL

{}

Attitude problems

{ NVIDIA: Tesla vs GeForce. Pricing and features. Differentiation. Crippling of GeForce. }

{ AMD: not enough push on the computing side, software issues (e.g. no computing without X11), etc. }


  1. the 2.1 architecture is actually more complex, since it can issue up to two instructions per warp at any single time, distributed among the 3 sets of 16 cores plus the extra 8 special-function units (SPU) and the 16 load/store units (LSU); as long as consecutive instructions for the same warp can be executed independently, they will be issued (and run) at the same time, resulting in an effective 4 warp instructions per cycle. ↩  ↩

  2. the third generation architecture is considerably more complex than previous ones: each MP has 192 cores, plus 32 special-function units (SFU) plus 32 load/store units (LSU); it can run 4 warps at a time, but it features the dual issue capability of 2.1 architectures1, so it can run up to two instructions per warp at the same time, resulting in the equivalent of 8 warp instructions per cycle. ↩

  3. or rather, in the same block, which is the higher logical level of execution, collecting warps that are issued together and that share LDS and synchronization points. ↩

  4. except in some of the low-end cards. ↩  ↩  ↩

  5. the first three hardware generations, from R600 to Evergreen (Radeon HD 2xxx to Radeon HD 5xxx) featured 4 ‘thin cores’ taking care of simple floating-point and integer operations, and a ‘fat core’ taking care of transcendental and other complex functions; the fourth generation (Norther Islands, Radeon HD 6xxx) only featured 4 equally-capable core; the fifth generation (Southern Islands, Radeon HD 7xxx) introduced a scalar ALU next to the vector ALU cluster: this is the first generation to use a SIMD instruction set instead of the VLIW5 and VLIW4 sets used in previous generations. ↩  ↩