One would assume that doing integer math with computers would be easy. After all, integer math is, in some sense, the “simplest” form of math: as Kronecker said:

Die ganzen Zahlen hat der liebe Gott gemacht, alles andere is Menschenwerk

The dear God has made integers, everything else is the work of man

While in practice this is (almost) always the case, introducing the extent (and particularly the limitations) to which integer math is easy (or, in fact, ‘doable’) is the first necessary step to understanding, later in this series of articles, some of the limitations of fixed-point math.

We start from the basics: since we are assuming a binary computer, we know that n bits can represent 2^n distinct values. So an 8-bit byte can represent 256 distinct values, a 16-bit word can represent 65536 distinct values, a 32-bit word can represent 4,294,967,296, and a 64-bit word a whooping 18,446,744,073,709,551,616, over 18 (short) trillion. Of course the question now is: which ones?

Representation of unsigned integers

Let's consider a standard 8-bit byte. The most obvious and natural interpretation of a byte (i.e. 8 consecutive bits) is to interpret it as a (non-negative, or unsigned) integer, just like we would interpret a sequence of consecutive (decimal) digits. So binary 00000000 would be 0, binary 00000001 would be (decimal) 1, binary 00000010 would be (decimal) 2 binary 00000011 would be (decimal) 3 and so on, up to (binary) 11111111 which would be (decimal) 255. From 0 to 255 inclusive, that's exactly the 256 values that can be represented by a byte (read as an unsigned integer).

Unsigned integers can be trivially promoted to wider words (e.g. from 8-bit byte to 16-bit word, preserving the numerical value) by padding with zeroes.

This is so simple that it's practically boring. Why are we even going through this? Because things are not that simple once you move beyond unsigned integers. But before we do that, I would like to point out that things aren't that simple even if we're just sticking to non-negative integers. In terms of representation of the numbers, we're pretty cozy: n bits can represent all non-negative integers from 0 to 2^n-1, but what happens when you start doing actual math on them?

Modulo and saturation

Let's stick to just addition and multiplication at first, which are the simplest and best defined operations on integers. Of course, the trouble is that if you are adding or multiplying two numbers between 0 and 255, the result might be bigger than 255. For example, you might need to do 100 + 200, or 128*2, or even just 255+1, and the result is not representable in an 8-bit byte. In general, if you are operating on n-bits numbers, the result might not be representable in n bits.

So what does the computer do when this kind of overflow happens? Most programmers will now chime in and say: well duh, it wraps! If you're doing 255+1, you will just get 0 as a result. If you're doing 128*2, you'll just get 0. If you're doing 100+200 you'll just get 44.

While this answer is not wrong, it's not right either.

Yes, it's true that the most common central processing units we're used to nowadays use modular arithmetic, so that operations that would overflow n-bits words are simply computed modulo 2^n (which is easy to implement, since it just means discarding higher bits, optionally using some specific flag to denote that a carry got lost along the way).

However, this is not the only possibility. For example, specialized DSP (Digital Signal Processing) hardware normally operates with saturation arithmetic: overflowing values are clamped to the maximum representable value. 255+1 gives 255. 128*2 gives 255. 100+200 gives 255.

Programmers used to the standard modular arithmetic can find saturation arithmetic ‘odd’ or ‘irrational’ or ‘misbehaving’. In particular, in saturation arithmetic (algebraic) addition is not associative, and multiplication does not distribute over (algebraic) addition.

Sticking to our 8-bit case, for example, with saturation arithmetic (100 + 200) - 100 results in 255 - 100 = 155, while 100 + (200 - 100) results in 100 + 100 = 200, which is the correct result. Similarly, still with saturation arithmetic, (200*2) - (100*2) results in 255 - 200 = 55, while (200 - 100)*2 results in 100*2 = 200. By contrast, with modular arithmetic, both expressions in each case give the correct result.

So, when the final result is representable, modular arithmetic gives the correct result in the case of a static sequence of operations. However, when the final result is not representable, saturation arithmetic returns values that are closer to the correct one than modular arithmetic: 300 is clamped to 255, in contrast to the severely underestimated 44.

Being as close as possible to the correct results is an extremely important property not just for the final result, but also for intermediate results, particularly in the cases where the sequence of operations is not static, but depends on the magnitude of the values (for example, software implementations of low- or high-pass filters).

In these applications (of which DSP, be it audio, video or image processing, is probably the most important one) both modular and saturation arithmetic might give the wrong result, but the modular result will usually be significantly worse than that obtained by saturation. For example, modular arithmetic might miscompute a frequency of 300Hz as 44Hz instead of 255Hz, and with a threshold of 100Hz this would lead to attenuation of a signal that should have passed unchanged, or conversely. Amplifying an audio signal beyond the representable values could result in silence with modular arithmetic, but it will just produce the loudest possible sound with saturation.

We mentioned that promotion of unsigned values to wider data types is trivial. What about demotion? For example, knowing that your original values are stored as 8-bit bytes and that the final result has to be again stored as an 8-bit byte, a programmer might consider operating with 16-bit (or wider) words to (try and) prevent overflow during computations. However, when the final result has to be demoted again to an 8-bit byte, a choice has to be made, again: should we just discard the higher bits (which is what modular arithmetic does), or return the highest representable value when any higher bits are set (which is what saturation arithmetic does)? Again, this is a choice for which there is no “correct” answer, but only answers that depend on the application.

To conclude, the behavior that programmers used to standard modular arithmetic might find ‘wrong’ is actually preferable in some applications (which is why it is has been supported in hardware in the multimedia and vector extensions (MMX and onwards) of the x86 architecture).

Thou shalt not overflow

Of course, the real problem in the examples presented in the previous section is that the data type used (e.g. 8-bit unsigned integers) was unable to represent intermediate or final results.

One of the most important things programmers should consider, maybe the most important, when discussing doing math on the computer, is precisely choosing the correct data type.

For integers, this means choosing a data type that can represent correctly not only the starting values and the final results, but also the intermediate values. If your data fits in 8 bits, then you want to use at least 16 bits. If it fits in 16 bits (but not 8), then you want to use at least 32, and so on.

Having a good understanding of the possible behaviors in case of overflow is extremely important to write robust code, but the main point is that you should not overflow.

Relative numbers: welcome to hell

In case you are still of the opinion that integer math is easy, don't worry. We still haven't gotten into the best part, which is how to deal with relative numbers, or, as the layman would call them, signed integers.

As we mentioned above, the ‘natural’ interpretation of n bits is to read them as natural, non-negative, unsigned integers, ranging from 0 to 2^n-1. However, let's be honest here, non-negative integers are pretty limiting. We would at least like to have the possibility to also specify negative numbers. And here the fun starts.

Although there is no official universal standard for the representation of relative numbers (signed integers) on computers, there is undoubtedly a dominating convention, which is the one programmers are nowadays used to: two's complement. However, this is just one of many (no less than four) possible representations:

  • sign bit and mantissa;
  • ones' complement;
  • two's complement;
  • offset binary aka biased representation.

Symmetry, zeroes and self-negatives

One of the issues with the representation of signed integers in binary computers is that binary words can always represent an even number of values, but a symmetrical amount of positive and negative integers, plus the value 0, is odd. Hence, when choosing the representation, one has to choose between either:

  • having one (usually negative) non-zero number with no representable opposite, or
  • having two representations of the value zero (essentially, positive and negative zero).

Of the four signed number representations enumerated above, the sign bit and ones' complement representations have a signed zero, but each non-zero number has a representable opposite, while two's complement and bias only have one value for zero, but have at least one non-zero number that has no representable opposite. (Offset binary is actually very generic and can have significant asymmetries in the ranges of representable numbers.)

Having a negative zero

The biggest issue with having a negative zero is that it violates a commonly held assumption, which is that there is a bijective correspondence between representable numerical values and their representation, since both positive and negative 0 have the same numerical value (0) but have distinct bit patterns.

Where this presents the biggest issue is in the comparison of two words. When comparing words for equality, we are now posed a conundrum: should they be compared by their value, or should they be compared by their representation? If a = -0, would a satisfy a == 0? Would it satisfy a < 0? Would it satisfy both? The obvious answer would be that +0 and -0 should compare equal (and just that), but how do you tell them apart then? Is it even worth it being able to tell them apart?

And finally, is the symmetry worth the lost of a representable value? (2^n bit patterns, but two of them have the same value, so e.g. with 8-bit bytes we have 256 patterns to represent 255 values instead of the usual 256.)

Having non-symmetric opposites

On the other hand, if we want to keep the bijectivity between value and representation, we will lose the symmetry of negation. This means, in particular, that knowing that a number a satisfies a < 0 we cannot deduce that -a > 0, or conversely, depending on whether the value with no opposite is positive or negative.

Consider for example the case of the standard two's complement representation in the case of 8-bit bytes: the largest representable positive value is 127, while the largest (in magnitude) representable negative value is -128. When computing opposites, all values between -127 and 127 have their opposite (which is the one we would expect algebraically), but negating -128 gives (again) -128 which, while algebraically wrong, is at least consistent with modular arithmetic, where adding -128 and -128 actually gives 0.

A brief exposition of the representations

Let's now see the representations in some more detail.

Sign bit and mantissa representation

The conceptually simplest approach to represent signed integers, given a fixed number of digits, is to reserve one bit to indicate the sign, and leave the other n-1 bits to indicate the mantissa i.e magnitude i.e. absolute value of the number. By convention, the sign bit is usually taken to be the most significant bit, and (again by convention) it is taken as 0 to indicate a positive number and 1 to indicate a negative number.

With this representations, two opposite values have the same representation except for the most significant bit. So, for example, assuming our usual 8-bit byte, 1 would be represented as 00000001, while -1 would be represented as 10000001.

In this representation, the highest positive value that can be represented with n bits is 2^{n-1} - 1, and the lowest (largest in magnitude) negative value that can be represented is its opposite. For example, with an 8-bit byte the largest positive integer is 127, i.e. 01111111, and the largest (in magnitude) negative integer is its opposite -127, i.e. 11111111.

As mentioned, one of the undersides of this representation is that it has both positive and negative zero, respectively represented by the 00000000 and 10000000 bit patterns.

While the sign bit and mantissa representation is conceptually obvious, its hardware implementation is more cumbersome that it might seem at first hand, since operations need to explicitly take the operands' signs into account. Similarly, sign-extension (for example, promoting an 8-bit byte to a 16-bit word preserving the numerical value) needs to ‘clear up’ the sign bit in the smaller-size representation before replicating it as the sign bit of the larger-size representation.

Ones' complement representation

A more efficient approach is offered by ones' complement representation, where negation maps to ones' complement, i.e. bit-flipping: the opposite of any given number is obtained as the bitwise NOT operation of the representation of the original value. For example, with 8-bit bytes, the value 1 is as usual represented as 00000001, while -1 is represented as 11111110.

The range of representable numbers is the same as in the sign bit and mantissa representation, so that, for example, 8-bit bytes range from -127 (10000000) to 127 (01111111), and we have both positive zero (00000000) and negative zero (11111111).

(Algebraic) addition in modular arithmetic with this representation is trivial to implement in hardware, with the only caveat that carries and borrows ‘wrap around’.

As in the sign-bit case, it is possible to tell if a number is positive or negative by looking at the most-significant bit, and 0 indicates a positive number, while 1 indicates a negative number (whose absolute value can then be obtained by flipping all the bits). Sign-extending a value can be done by simply propagating the sign bit of the smaller-size representation to all the additional bits in the larger-size representation.

Two's complement

While ones' complement representation is practical and relatively easy to implement in hardware, it is not the simplest, and it's afflicted by the infamous ‘negative zero’ issue.

Because of this, two's complement representation, which is simpler to implement and has no negative zero, has gained much wider adoption. It also has the benefit of ‘integrating’ rather well with the equally common modular arithmetic.

In two's complement representation, the opposite of an n-bit value is obtained by subtracting it from 2^n, or, equivalently, from flipping the bits and then adding 1, discarding any carries beyond the n-th bit. Using our usual 8-bit bytes as example, 1 will as usual be 00000001, while -1 will be 11111111.

The largest positive representable number with n bits is still 2^{n-1}-1, but the largest (in magnitude) negative representable number is now -2^{n-1}, and it's represented by a high-bit set to 1 and all other bits set to 0. For example, with 8-bit bytes the largest positive number is 127, represented by 01111111, whose opposite -127 is represented by 10000001, while the largest (in magnitude) negative number is -128, represented by 10000000.

In two's complement representation, there is no negative zero and the only representation for 0 is given by all bits set to 0. However, as discussed earlier, this leads to a negative value whose opposite is the value itself, since the representation of largest (in magnitude) negative representable number is invariant by two's complement.

As in the other two representations, the most significant bit can be checked to see if a number is positive and negative. As in ones' complement case, sign-extension is done trivially by propagating the sign bit of the smaller-size value to all other bits of the larger-size value.

Offset binary

Offset binary (or biased representation) is quite different from the other representations, but it has some very useful properties that have led to its adoption in a number of schemes (most notably the IEEE-754 standard for floating-point representation, where it's used to encode the exponent, and some DSP systems).

Before getting into the technical details of offset binary, we look at a possible motivation for its inception. The attentive reader will have noticed that all the previously mentioned representations of signed integers have one interesting property in common: they violate the natural ordering of the representations.

Since the most significant bit is taken as the sign bit, and negative numbers have a most significant bit set to one, natural ordering (by bit patterns) puts them after the positive numbers, whose most significant bit is set to 0. Additionally, in the sign bit and mantissa representation, the ordering of negative numbers is reversed with respect to the natural ordering of their representation. This means that when comparing numbers it is important to know if they are signed or unsigned (and if signed, which representation) to get the ordering right. The biased representation is one way (and probably the most straightforward way) to circumvent this.

The basic idea in biased representation or offset binary is to ‘shift’ the numerical value of all representations by a given amount (the bias or offset), so that the smallest natural representation (all bits 0) actually evaluates to the smallest representable number, and the largest natural representation (all bits 1) evaluates to the largest representable number.

The bias is the value that is added to the (representable) value to obtain the representation, and subtracted from the representation to obtain the represented value. The minimum representable number is then the opposite of the bias. Of course, the range of representable numbers doesn't change: if your data type can only represent 256 values, you can only choose which 256 values, as long as they are consecutive integers.

The bias in an offset binary representation can be chosen arbitrarily, but there is a ‘natural’ choice for n-bit words, which is 2^{n-1}: halfway through the natural representation. For example, with 8-bit bytes (256 values) the natural choice for the bias is 128, leading to a representable range of integers from -128 to 127, which looks distinctly similar to the one that can be expressed in two's complement representation.

In fact, the 2^{n-1} bias leads to a representation which is equivalent to the two's complement representation, except for a flipped sign bit, solving the famous signed versus unsigned comparison issue mentioned at the beginning of this subsection.

As an example, consider the usual 8-bit bytes with a bias of 128: then, the numerical values 1, 0 and -1 would be represented by the ‘natural’ representation of the values 129, 128 and 127 respectively, i.e. 10000001, 10000000 and 01111111: flipping the most significant bits, we get 00000001, 00000000 and 11111111 which are the two's complement representation of 1, 0 and -1.

Of course, the ‘natural’ bias is not the only option: it is possible to have arbitrary offsets, which makes offset binary extremely useful in applications where the range of possible values is strongly asymmetrical around zero, or where it is far from zero. Of course, such arbitrary biases are rarely supported in hardware, so operation on offset binary usually requires software implementations of even the most common operations, with a consequent performance hit. Still, assuming the hardware uses modular arithmetic, offset binary is at least trivial to implement for the basic operations.

One situation in which offset binary doesn't play particularly well is that of sign-extension, which was trivial in ones' and two's complement represnetations. The biggest issue in the case of offset binary is, obviously, that the offsets in the smaller and larger data types are likely going to be different, although usually not arbitrarily different (biases are often related to the size of the data type).

At least in the case of the ‘natural’ bias (in both the smaller and larger data types), sign extension can be implemented straightforwardly by going through the two's complement equivalent representation: flip the most significant bit of the smaller data type, propagate it to all the remaining bits of the larger data type, and then flip the most significant bit of the larger data type. (In other words: convert to two's complement, sign extend that, convert back to offset binary with the ‘natural’ bias.)

What does a bit pattern mean?

We're now nearing the end of our discussion on integer math on the computers. Before getting into the messy details of the first common non-integer operation (division), I would like to ask the following question: what do you get if you do 10100101 + 01111111?

Divide and despair

To conclude our exposition of the joys of integer math on the computers, we now discuss the beauty of integer division and the related modulus operation.

Since division of the integer e by the integer o only gives an integer (mathematically) if e is a multiple of o, the concept of ‘integer division’ has arised in computer science as a way to obtain an integer d from e/o even when o does not divide e.

The simple case

Let's start by assuming that e is non-negative and o is (strictly) positive. In this case, integer division gives the largest integer d such that d*o ≤ e. In other words, the result of the division of e by o is truncated, or ‘approximated by defect’, however small the remainder might be: 3/5=0 and 5/3=1 with integer division, even though in the latter case we would likely have preferred a value of 2 (think of 2047/1024, for example).

The upside of this choice is that it's trivial to implement other forms of division (that round up, or to the nearest number, for example), by simply adding appropriate correcting factors to the dividend. For example, round-up division is achieved by adding the divisor diminished by a unit to the divident: integer divisoin (e + o - 1)/o will give you e/o, rounded up: (3+5-1)/5 = 7/5 = 1, and (5 + 3 - 1)/3 = 7/3 = 2.

Division by zero

What happens when o is zero? Mathematically, division by zero is not defined (although in some context where infinity is considered a valid value, it may give infinity as a result —as long as the dividend is non-zero). In hardware, anything can happen.

There's hardware that flags the error. There's hardware that produces bogus results without any chance of knowing that a division by zero happened. There's hardware that produces consistent results (always zero, or the maximum representable value), flagging or not flagging the situation.

‘Luckily’, most programming languages always treat a division by zero as an exception, which by default causes a program termination. Of course, this means that to write robust code it's necessary to sprinkle the code with conditionals to check that divisions will successfully complete.

Negative numbers

If the undefined division by zero may not be considered a big issue per se, the situation is much more interesting when either of the operands of the division is a negative number.

First of all, one would be led to think that at least the sign of the result would be well defined: negative if the operands have opposite sign, positive otherwise. But this is not the case for the widespread two's complement representation with modular arithmetic, where the division of two negative numbers can give a negative number: of course, we're talking about the corner case of the largest (in magnitude) negative number, which when divided by -1 returns itself, since its opposite is not representable.

But even when the sign is correct, the result of integer division is not uniquely determined: some implementations round down, so that -7/5 = -2, while others round towards zero, so that -7/5 = -1: both the choices are consistent with the positive integer division, but the results are obviously different, which can introduce subtle but annoying bugs when porting code across different languages or hardware.

Modulo

The modulo operation is perfectly well defined for positive integers, as the reminder of (integer) division: the quotient d and the reminder r of (integer) division e/o are (non-negative) integers such that e = o*d + r and r < o.

Does the same hold true when either e or o are negative? It depends on the convention adopted by the language and/or hardware. While for negative integer division there are ‘only’ two standards, for the modulo operation there are three:

  • a result with the sign of the dividend;
  • a result with the sign of the divisor;
  • a result that is always non-negative.

In the first two cases, what it means is that, for example, -3 % 5 will have the opposite sign of 3 % -5; hence, if one would satisfy the quotient/reminder equation (which depends on whether integer division rounds down or towards zero), the other obviously won't. In the third case, the equation would only be satisfied if the division rounds down, but not if the division rounds towards zero.

This could lead someone to think that the best choice would be a rounding-down division with an always non-negative modulo. Too bad that rounding-down division suffers from the problem that -(e/o) ≠ (-e)/o.

Summary

Integer math on a computer is simple only as far as you never think about dealing with corner cases, which you should if you want to write robust, reliable code. With integer math, this is the minimum of what you should be aware of: