A little bit of history

My first contact with GPGPU happened during my first post-doc, although for reasons totally unrelated to work: an open-source videogame (an implementation of the Settlers of Catan boardgame) I was giving small contributions to happened to have a header file which was (legitimately) copied over from the GPGPU programming project.

This was 2006, so the stuff at the time was extremely preliminary and not directly supported by the hardware manufacturers, but it did open my eyes to the possibility of using graphic cards, whose main development was geared towards hard-core gaming, for other computational purposes, and particularly scientific ones.

So where does GPGPU come from?

The term GPU (Graphic Processing Unit) emerges in the mid-90s, to describe graphic cards and other video hardware with enough computational power to take care of the heavy-duty task of rendering complex, animated three-dimensional scenes in real time.

Initially, although GPUs were computationally more gifted than their predecessors whose most complex task was blitting (combining rectangular pixel blocks with binary operators such as AND, OR or XOR), their computational power was limited to a set of operations which is nowadays knows as the “fixed-functions pipeline”.

The barebone essentials you need to render a three-dimensional scene is: a way to describe the geometry of the objects, a way to describe the position of the light(s), and a way to describe the position of the observer. Light and observer positions are little more than points in three-dimensional space (for the observer you also need to know which way is ‘up’ and what his field of view is, but those are details we don't particularly care about now), and geometries can be described by simple two-dimensional figures immersed in three-dimensional space: triangles, squares. Of course, since simple colors will not get you far, you also want to paint the inside of these triangles and squares with some given pictures (e.g. something that resembles cobblestone), a process that is called ‘texturing’.

Once you have the geometry (vertices), lights and observer, rendering the scene is just a matter of doing some mathematical operations on them, such as interpolation between vertices to draw lines, or projections (i.e. matrix/vector products) from the three-dimensional space onto the two-dimensional visual plane of the observer. Of course, this has to be done for every single triangle in the scene (and you can have hundreds, thousands, hundreds of thousands or even millions of triangles in a scene), every time the scene is rendered (which should be at least as often as the screen refreshes, so at least some 50, nowadays 60, times per second).

Fixed-function pipelines in GPUs are therefore optimized for very simple mathematical operations, repeated millions (nowadays even billions) of times per second. But as powerful as you can get, there are limits to where simple triangles and a naive lighting model can get you: and this is why, by the end of the XX century, hardware support for shaders started popping up on GPUs.

Shaders are programs that can compute sophisticated lighting effects (of which shadows are only a small part). Since the effects that may be achieved with shaders are very varied, they may not be implemented within the classic fixed-function pipeline. Dedicated computational hardware that could execute these programs (called kernels) had to be introduced.

And suddenly, video cards were not fixed-function devices anymore, but had become programmable, even though still with limitations and peculiar behavior: shader kernels are programs that gets executed on each vertex of the geometry, or on each pixel of the scene, and only a limited number of computational features were initially available, since the hardware was still designed for the kind of manipulation that would be of interest for 3D rendering.

However, with all their limitations, GPUs now had a very interesting feature: you could tell them to do a specific set of operations on each element of a set (vertex, pixel). The essence of parallel programming, with hardware designed specifically for it. So why not abuse this capability to do things which have nothing to do with three-dimensional scene rendering?

This is where GPGPU started, with some impressive (for the time) results. Of course, it was all but trivial: you had to fake a scene to be rendered, pass it to the card, ask it to render it and manipulate the scene data with some shader kernels, and then get the resulting rendered scene and interpret it as the result of the computation. Possible, but clumsy, so that a number of libraries and other development tools started appearing (such as Brook) to make the task easier.

As usage of GPUs for non-graphical tasks spread, hardware manufacturers started to realize that there was an opportunity in making things easier for developers, and the true power of GPGPU was made available.

The first ‘real’ GPGPU solutions started appearing between 2006 and 2007, when the two (remaining) major GPU manufacturer (ATi —shortly after acquired by AMD— and NVIDIA) realized that with minimal effort it was possible to expose the shader cores of the GPU and make them available beyond the simple scope of scene rendering.

Although buffers, texture engines and shader cores were now made accessible outside of the rendering pipeline, their functional behavior was not altered significantly, something that has a significant impact on their optimal usage patterns and some behavior peculiarities that inevitably arise during the use of GPUs as computing devices.

The GPU is (not) my co-processor

Before the Pentium, Intel (and compatible) CPUs had very limited (floating point) math capabilities, since they were deemed unnecessary for the common market. If you were a scientist or other white-collar worker that needed fast floating-point computations, you could however shell money for an FPU (Floating-point Unit), an auxiliary processor specialized in floating-point operations; these units were marked with a numerical code which was the same as the CPU, except for the final digit, a 7 instead of a 6: so you would have the 8087 next to an 8086, or a 387 next to a 386; and by ‘next’ I mean physically next to it, because the socket where the FPU had to be inserted was typically adjacent to the socket of the CPU.

The FPU as a co-processor started disappearing with the 486, which had two variants, whose high-level one (the 486DX) had an FPU integrated in the actual CPU. With the introduction of the Pentium, the FPU started being a permanent component of the CPU, and it started evolving (it had remained essentially unchanged since the inception) to support the famous extended ‘multimedia’ instruction sets (MMX, 3DNow!, the various SSE generations, up until the latest AVX extension) of subsequent CPUs. (And by the way, the fact that the FPU and MMX functionalities were implemented in the same piece of hardware had a horrible impact on performance when you used both at the same time. But that's a different topic.)

One of the tenets of GPGPU (marketing) is that the GPU can be used as a co-processor of the CPU. However, there are some very important differences between a co-processor like the FPU, and the GPU.

First of all, the FPU was physically attached to the same bus as the CPU, and FPU instructions were part of the CPU instruction set: the CPU had to detect the FPU instructions and either pass control to the FPU or decode the instructions itself and then pass the decoded instruction to the FPU. Secondly, even though the FPU has a stack of registers, it doesn't have its own RAM.

By contrast, the GPU is more like a co-computer: it has its own RAM, and its own instruction set of which the CPU is completely unaware. The GPU is not controlled by the CPU directly, as it happens with a co-processor, but rather the software driver instructs the CPU to send the GPU specific bits which the GPU will interpret as more or less abstract commands such as “load this program (kernel)”, “copy this data”, “execute this other program (kernel)”.

Since all communication has to go through the PCI bus, naively using the GPU as a coprocessor is extremely inefficient: most of the time would be spent just exchanging commands and data; this, in fact, was one of the reasons why the old GPGPU approach based on the graphics stack ended up consistently underperforming with respect to the expectable GPU speed.

The most efficient use of the GPU is therefore as an external machine, communication with which should be limited to the bare minimum: upload as much data as possible at once, load all the programs, issue the programs in sequence, and don't get any (intermediate) data back until it's actually needed on the CPU. It's not just a matter of offloading heavy computations to the GPU: it's about using a separate, complex device for what it was designed for.

How much faster is a GPU?

When the GPGPU craze found its way into marketing (especially with NVIDIA's push for their new CUDA technology), the GPUs were boasted as cheap high-performance co-processors that would allow programs to reach a speed-up of two orders of magnitude (over a hundred times faster!), and a large collection of examples showcasing these incredible benefits started coming up. The orders of magnitude of speed-ups even became the almost only topic of the first published ‘research’ papers on the subject.

Although such incredible speed-ups are quite possible when using GPUs, the reality is quite more complex, and a non-negligible part of these speed-ups are actually possible even on standard CPUs. To understand more in detail what practical speed-ups can be expected, we have to look at the fundamental areas where GPUs perform (potentially) much better than CPUs (computational power and memory bandwidth), and the conditions under which this better performance can actually be achieved.

Faster memory (?)

Let us look at memory first. It's undeniably true that GPU memory is designed to have a much higher bandwidth than the RAM normally mounted on the computer motherboard (hereafter referred to as ‘CPU memory’): in 2007, when the GPGPU started being officially supported by hardware manufacturers, GPUs' memory had peak theoretical bandwidths ranging from 6.4 GB/s (on low-end GPUs using DDR2 chips for memory) to over 100 GB/s (on high-end cards using GDDR3 chips). By comparison, CPUs usually had DDR2 chips, whose performance ranges from 3.2 GB/s to 8.5 GB/s. Now (2012) GPUs can reach bandwidths of almost 200 GB/s with GDDR5 memory, whereas the best CPUs can hope for is less than 20 GB/s on DDR3.

Since the bandwidth is almost consistently an order of magnitude higher on GPUs than on CPUs, one should expect an order of magnitude in speed-up for a problem that is memory-bound (does a lot of memory access and very little computations), assuming it can get close to the theoretical bandwidth peak and assuming the data is already on the device.

We'll talk about the problem of putting data on the device later on, but we can mention a few things about reaching the peak bandwidth already, without getting into too much details.

The silver lining in the higher bandwidth on GPUs is latency. While CPU to (uncached) RAM access latency is usually less than 100ns, on GPUs this is 3 to 5 times higher; and the first GPUs had no cache to speak of (except for textures, but that's a different matter, since textures also have lower bandwidth). Of course, GPUs have specific methods to cover this high latency: after all, a GPU is optimized for moving large slabs of data around, as long as such data is organized ‘appropriately’, and the memory access are designed accordingly.

Therefore, memory-bound GPU algorithms have to be designed in such a way that they make as much as possible use of these latency reduction techniques (coalescing on NVIDIA GPUs, fastpath usage on AMD GPUs), lest they see their performance drop from being 10 times faster than on CPU to being no more than 2 or 3 times faster. These remarks are particularly important for the implementation of scatter/gather or sorting algorithms.

Faster computing (?)

Of course, where GPUs really shine is not in juggling data around, but in doing actual computations on them: gamer GPUs passed the (theoretical) teraFLOPS barrier in 2008 (Radeon HD 4850), when the best (desktop) CPU of the time fell short of achieving some theoretical 60 gigaFLOPS, and most common ones couldn't dream of getting half as much.

But from 20 gigaFLOPS to 1 teraFLOPs there's only a factor of 50: so where do the claimed two orders of magnitude in speedup come from? Unsurprisingly, the difference comes from a consistent underutilization of the CPUs. We'll leave that aside for the moment, though, and focus instead on the impressive (theoretical) performance sported by GPUs.

The first thing that should be mentioned about GPUs is that they are not designed to be fast in the sense that CPUs are fast. For years, CPU performance was strongly tied to the frequency at which it operates, with a theoretical upper limit of one instruction per cycle, which would mean that a CPU running at 1GHz couldn't do more than one (short) billion operations per seconds. These days, the fastest desktop CPUs run at over 3GHz, while the fastest GPUs have computing clocks which are at about 1GHz, or even less.

However, GPUs are designed for massively parallel tasks, such as running a specific sequence of instructions on each element of a set of vertices or pixels, with each element being processed independently (or almost independently) from the other. The shaders in GPUs are made up by a large number of processing elements collected in multiprocessors, with each multiprocessor capable of executing the same single instruction (sequence) on a large number of elements at once.

In some sense, GPUs can be seen as a collection of SIMD (Single Instruction, Multiple Data) processors (typically, 10 or more), each with a very wide vector width (typically 32 for NVIDIA, 64 for AMD); while modern CPUs are also SIMD-capable, with their MMX and SSE instructions, and can also sport multiple cores, they have less SIMD lanes (typically 4 or 8), and less cores (2 to 6) than GPUs.

The GPU programming tools and languages expose this massive parallel computing capability, and make it very easy to exploit it. The simplest GPU programs consist in a kernel, i.e. sequence of instructions that are to be executed (typically) on a single element of a set, which is given to the GPU to be run on all the elements of a given set.

By contrast, exploiting the vector capabilities of modern CPUs and their multiple cores require complex programming techniques, special instructions which are barely more abstract than their hardware counterparts, and complex interactions with the operating system to handle multiple threads to be distributed across the cores.

In other words, it's much easier to exploit the massively parallel nature of the GPUs than it is to exploit the available parallel computing capabilities of the CPUs. And this is where the two orders of magnitude in performance difference come from: CPUs are rarely used as more than single-core scalar processors.

Still, even when comparing well-designed CPU programs with well-designed GPU programs it's not surprising to see minutes of runtime be reduced to seconds. If you see less than that, you're probably doing something wrong, and if you're seeing much more than that, your CPU program is probably far from being optimal.

The question then becomes: how hard is it to write a well-designed GPU program as opposed to a well-designed CPU program? But this is a question I'll leave for later. For the time being, let's just leave it at: non-trivial.

Up and down (loads)

As previously mentioned, GPUs normally have their own memory, separate from the system memory (this is not true for integrated GPUs, but they deserve a separate paragraph). Therefore, using the GPU involves transferring data to it, and then retrieving the results when the kernel(s) have finished their operation.

The time spent uploading data to the GPU and downloading data from it is not necessarily insignificant: through a PCI express 2.0 ×16 link you can hope for an 8 GB/s transfer rate, but 5 or 6 GB/s are more likely; and this is pretty close to being the top of the line. When compared to the GPU or even the CPU memory bandwidth, this can be a very significant bottleneck.

This, combined with the relatively small amount of memory available on GPUs (less than a gigabyte when GPGPU started, slightly over the gigabyte four years later), poses an interesting paradox on the convenience of GPUs.

On the one hand, GPUs are most convenient when processing large amounts of data in parallel: this ensures, together with well-designed algorithms, that the GPU hardware is used full-scale for an adequate amount of time.

On the other hand, there's a limit to the amount of data you can load at once on the GPU: desktops today are commonly equipped with 4 gigabytes of RAM or more (dedicated workstations or servers can easily go in the tens of gigabytes), so they can typically hold larger amounts of data. The only way to process this on standard desktop GPUs is to do it in chunks, which means uploading the first chunk, processing it, downloading the result, uploading the new chunck, and so on.

Luckily enough, GPUs are typically equipped with asynchronous copy engines, so the situation is not as dreary as it would be. In many cases, especially with modern devices, it is possible to overlap computations and host/device data transfers so as to hide the overhead of the data exchange. This, in fact, is one of the many ways in which GPGPU programming can become complex when optimal performance is sought.

Is it still worth it?

Even if two orders of magnitude may not be possible to achieve without extensive programming efforts to produce extremely optimized code, the simple order of magnitude one can get for most trivially parallelizable problems is most often worth the time necessary to reimplement computationally-heavy code to use GPGPU.

One of the most interesting feature of the shared memory parallel programming approach needed for GPGPU is that, when it can be employed, it's a much more future-proof way of coding than serial execution. The reason is clear: while serial execution can only improve by using faster processors (and there are upper physical limits which are getting closer by the year to how fast a scalar processor can go), parallel algorithms can get faster by ‘just’ adding more computationa units. In theory, a perfectly parallel algorithm will take half the time to run on twice the cores, and while the reality is less ideal, the performance gain is still quite perceptible.

The hidden benefits of GPGPU

In many ways, the most important contribution of GPGPU to the domain of computer science and software engineering has not been the actual performance benefits that a lot of fields have seen from the availability of cheap parallel computing platforms.

There's indeed a much longer-term benefit, that will be reaped over the next years, and it's precisely the shift from serial to parallel programing we just mentioned. Before GPGPU, parallel programming was left to the domain of expensive computational clusters and sophisticated programming techniques; GPGPU has shown that there are huge opportunities for shared-memory parallel programming even on the lower end of the market.

The reborn interest in parallel programming triggered by GPGPU is gradually leading to the development of an entirely new mentality both in terms of software development and hardware realities. Although it is still to be seen how it will ultimately pan out, there are significant signs that we're only starting to scratch the surface of technologies that can revolutionize computing to an extent that could only be compared with the effects of the commercialization of the Internet twenty years ago, and the introduction of the first microcomputers twenty years before that.

GPGPU is bleeding out of the GPU market, in an interesting combination of paradoxical feedbacks and returns. There are people that have implemented ray-tracing using GPGPU: the GPUs go back to their intended task, but through features that were designed to make them usable outside of their domain. At the same time, CPUs gain more powerful parallel computing features, and integrated CPU/GPU solutions bring the GPU more in line with the standard co-processor role marketing wanted to sell GPGPU for.

We are starting to see a convergence in technolgy. At this point, the only danger to the rich potential of this dawning era is the petty commercial interest of companies that would rather see the market collapse under fragmentation than prosper without their dominance.

Let us hope that this won't happen.