One would assume that doing integer math with computers would be easy. After all, integer math is, in some sense, the “simplest” form of math: as Kronecker said:

Die ganzen Zahlen hat der liebe Gott gemacht, alles andere is Menschenwerk

The dear God has made integers, everything else is the work of man

While in practice this is (almost) always the case, introducing the extent (and particularly the limitations) to which integer math is easy (or, in fact, ‘doable’) is the first necessary step to understanding, later in this series of articles, some of the limitations of fixed-point math.

We start from the basics: since we are assuming a binary computer, we
know that *n* bits can represent 2^*n* distinct values. So an 8-bit byte
can represent 256 distinct values, a 16-bit word can represent 65536
distinct values, a 32-bit word can represent 4,294,967,296, and a 64-bit
word a whooping 18,446,744,073,709,551,616, over 18 (short) trillion. Of
course the question now is: *which ones*?

## Representation of unsigned integers

Let's consider a standard 8-bit byte. The most obvious and natural
*interpretation* of a byte (i.e. 8 consecutive bits) is to interpret it
as a (non-negative, or unsigned) integer, just like we would interpret a
sequence of consecutive (decimal) digits. So binary `00000000`

would be 0,
binary `00000001`

would be (decimal) 1, binary `00000010`

would be (decimal)
2 binary `00000011`

would be (decimal) 3 and so on, up to (binary)
`11111111`

which would be (decimal) 255. From 0 to 255 inclusive, that's
exactly the 256 values that can be represented by a byte (read as an
unsigned integer).

Unsigned integers can be trivially promoted to wider words (e.g. from 8-bit byte to 16-bit word, preserving the numerical value) by padding with zeroes.

This is so simple that it's practically boring. Why are we even going
through this? Because things are *not* that simple once you move beyond
unsigned integers. But before we do that, I would like to point out that
things aren't that simple even if we're just sticking to non-negative
integers. In terms of *representation* of the numbers, we're pretty
cozy: *n* bits can represent all non-negative integers from 0 to
2^*n*-1, but what happens when you start doing actual math on them?

## Modulo and saturation

Let's stick to just addition and multiplication at first, which are the
simplest and best defined operations on integers. Of course, the trouble
is that if you are adding or multiplying two numbers between 0 and 255,
the result might be *bigger* than 255. For example, you might need to do
100 + 200, or 128*2, or even just 255+1, and the result is not
representable in an 8-bit byte. In general, if you are operating on
*n*-bits numbers, the result might not be representable in *n* bits.

So what does the computer do when this kind of *overflow* happens? Most
programmers will now chime in and say: well *duh*, it wraps! If you're
doing 255+1, you will just get 0 as a result. If you're doing 128*2,
you'll just get 0. If you're doing 100+200 you'll just get 44.

While this answer is not wrong, it's not right either.

Yes, it's true that the most common central processing units we're used
to nowadays use *modular arithmetic*, so that operations that would
overflow *n*-bits words are simply computed modulo 2^*n* (which is easy
to implement, since it just means discarding higher bits, optionally
using some specific flag to denote that a carry got lost along the way).

However, this is not the only possibility. For example, specialized DSP
(Digital Signal Processing) hardware normally operates with *saturation*
arithmetic: overflowing values are *clamped* to the maximum
representable value. 255+1 gives 255. 128*2 gives 255. 100+200 gives
255.

Programmers used to the standard modular arithmetic can find saturation arithmetic ‘odd’ or ‘irrational’ or ‘misbehaving’. In particular, in saturation arithmetic (algebraic) addition is not associative, and multiplication does not distribute over (algebraic) addition.

Sticking to our 8-bit case, for example, with saturation arithmetic (100 + 200) - 100 results in 255 - 100 = 155, while 100 + (200 - 100) results in 100 + 100 = 200, which is the correct result. Similarly, still with saturation arithmetic, (200*2) - (100*2) results in 255 - 200 = 55, while (200 - 100)*2 results in 100*2 = 200. By contrast, with modular arithmetic, both expressions in each case give the correct result.

So, *when the final result is representable*, modular arithmetic gives
the correct result in the case of a static sequence of operations.
However, when the final result is *not* representable, saturation
arithmetic returns values that are closer to the correct one than
modular arithmetic: 300 is clamped to 255, in contrast to the severely
underestimated 44.

Being as close as possible to the correct results is an extremely important property not just for the final result, but also for intermediate results, particularly in the cases where the sequence of operations is not static, but depends on the magnitude of the values (for example, software implementations of low- or high-pass filters).

In these applications (of which DSP, be it audio, video or image processing, is probably the most important one) both modular and saturation arithmetic might give the wrong result, but the modular result will usually be significantly worse than that obtained by saturation. For example, modular arithmetic might miscompute a frequency of 300Hz as 44Hz instead of 255Hz, and with a threshold of 100Hz this would lead to attenuation of a signal that should have passed unchanged, or conversely. Amplifying an audio signal beyond the representable values could result in silence with modular arithmetic, but it will just produce the loudest possible sound with saturation.

We mentioned that promotion of unsigned values to wider data types is trivial. What about demotion? For example, knowing that your original values are stored as 8-bit bytes and that the final result has to be again stored as an 8-bit byte, a programmer might consider operating with 16-bit (or wider) words to (try and) prevent overflow during computations. However, when the final result has to be demoted again to an 8-bit byte, a choice has to be made, again: should we just discard the higher bits (which is what modular arithmetic does), or return the highest representable value when any higher bits are set (which is what saturation arithmetic does)? Again, this is a choice for which there is no “correct” answer, but only answers that depend on the application.

To conclude, the behavior that programmers used to standard modular arithmetic might find ‘wrong’ is actually preferable in some applications (which is why it is has been supported in hardware in the multimedia and vector extensions (MMX and onwards) of the x86 architecture).

## Thou shalt not overflow

Of course, the real problem in the examples presented in the previous section is that the data type used (e.g. 8-bit unsigned integers) was unable to represent intermediate or final results.

One of the most important things programmers should consider, maybe *the*
most important, when discussing doing math on the computer, is precisely
choosing the correct data type.

For integers, this means choosing a data type that can represent correctly not only the starting values and the final results, but also the intermediate values. If your data fits in 8 bits, then you want to use at least 16 bits. If it fits in 16 bits (but not 8), then you want to use at least 32, and so on.

Having a good understanding of the possible behaviors in case of
overflow is extremely important to write robust code, but the main point
is that *you should not overflow*.

## Relative numbers: welcome to hell

In case you are still of the opinion that integer math is easy, don't
worry. We still haven't gotten into the best part, which is *how to deal
with relative numbers*, or, as the layman would call them, *signed*
integers.

As we mentioned above, the ‘natural’ interpretation of *n* bits is to
read them as natural, non-negative, unsigned integers, ranging from 0 to
2^*n*-1. However, let's be honest here, non-negative integers are pretty
limiting. We would at least like to have the possibility to also
specify *negative* numbers. And here the fun starts.

Although there is no official universal standard for the representation
of relative numbers (signed integers) on computers, there is undoubtedly
a *dominating* convention, which is the one programmers are nowadays
used to: two's complement. However, this is just one of *many* (no less
than four) possible representations:

- sign bit and mantissa;
- ones' complement;
- two's complement;
- offset binary aka biased representation.

### Symmetry, zeroes and self-negatives

One of the issues with the representation of signed integers in binary
computers is that binary words can always represent an *even* number of
values, but a symmetrical amount of positive and negative integers, plus
the value 0, is odd. Hence, when choosing the representation, one has to
choose between either:

- having one (usually negative) non-zero number with no representable opposite, or
- having two representations of the value zero (essentially, positive and negative zero).

Of the four signed number representations enumerated above, the sign bit and ones' complement representations have a signed zero, but each non-zero number has a representable opposite, while two's complement and bias only have one value for zero, but have at least one non-zero number that has no representable opposite. (Offset binary is actually very generic and can have significant asymmetries in the ranges of representable numbers.)

#### Having a negative zero

The biggest issue with having a negative zero is that it violates a commonly held assumption, which is that there is a bijective correspondence between representable numerical values and their representation, since both positive and negative 0 have the same numerical value (0) but have distinct bit patterns.

Where this presents the biggest issue is in the comparison of two words.
When comparing words for equality, we are now posed a conundrum: should
they be compared by their *value*, or should they be compared by their
*representation*? If `a = -0`

, would `a`

satisfy `a == 0`

? Would it
satisfy `a < 0`

? Would it satisfy both? The obvious answer would be that
+0 and -0 should compare equal (and just that), but how do you tell them
apart then? Is it even worth it being able to tell them apart?

And finally, is the symmetry worth the lost of a representable value?
(2^*n* bit patterns, but two of them have the same value, so e.g. with
8-bit bytes we have 256 patterns to represent 255 values instead of the
usual 256.)

#### Having non-symmetric opposites

On the other hand, if we want to keep the bijectivity between value and
representation, we will lose the symmetry of negation. This means, in
particular, that knowing that a number `a`

satisfies `a < 0`

we cannot
deduce that `-a > 0`

, or conversely, depending on whether the value with
no opposite is positive or negative.

Consider for example the case of the standard two's complement
representation in the case of 8-bit bytes: the largest
representable positive value is 127, while the largest (in magnitude)
representable negative value is -128. When computing opposites, all
values between -127 and 127 have their opposite (which is the one we
would expect algebraically), but negating -128 gives (again) -128 which,
while algebraically wrong, is at least *consistent* with modular
arithmetic, where adding -128 and -128 actually gives 0.

### A brief exposition of the representations

Let's now see the representations in some more detail.

#### Sign bit and mantissa representation

The conceptually simplest approach to represent signed integers, given a
fixed number of digits, is to reserve one bit to indicate the sign, and
leave the other *n*-1 bits to indicate the mantissa i.e magnitude i.e.
absolute value of the number. By convention, the sign bit is usually
taken to be the most significant bit, and (again by convention) it is
taken as 0 to indicate a positive number and 1 to indicate a negative
number.

With this representations, two opposite values have the same
representation *except for the most significant bit*. So, for example,
assuming our usual 8-bit byte, 1 would be represented as `00000001`

, while
-1 would be represented as `10000001`

.

In this representation, the highest *positive* value that can be
represented with *n* bits is 2^{*n-1*} - 1, and the lowest (largest in
magnitude) *negative* value that can be represented is its opposite. For
example, with an 8-bit byte the largest *positive* integer is 127, i.e.
`01111111`

, and the largest (in magnitude) *negative* integer is its
opposite -127, i.e. `11111111`

.

As mentioned, one of the undersides of this representation is that it
has both positive and negative zero, respectively represented by the
`00000000`

and `10000000`

bit patterns.

While the sign bit and mantissa representation is conceptually obvious, its hardware implementation is more cumbersome that it might seem at first hand, since operations need to explicitly take the operands' signs into account. Similarly, sign-extension (for example, promoting an 8-bit byte to a 16-bit word preserving the numerical value) needs to ‘clear up’ the sign bit in the smaller-size representation before replicating it as the sign bit of the larger-size representation.

#### Ones' complement representation

A more efficient approach is offered by ones' complement representation,
where negation maps to ones' complement, i.e. bit-flipping: the opposite
of any given number is obtained as the bitwise NOT operation of the
representation of the original value. For example, with 8-bit bytes, the
value 1 is as usual represented as `00000001`

, while -1 is represented as
`11111110`

.

The range of representable numbers is the same as in the sign bit and
mantissa representation, so that, for example, 8-bit bytes range from
-127 (`10000000`

) to 127 (`01111111`

), and we have both positive zero
(`00000000`

) and negative zero (`11111111`

).

(Algebraic) addition in modular arithmetic with this representation is trivial to implement in hardware, with the only caveat that carries and borrows ‘wrap around’.

As in the sign-bit case, it is possible to tell if a number is positive
or negative by looking at the most-significant bit, and 0 indicates a
positive number, while 1 indicates a negative number (whose absolute
value can then be obtained by flipping all the bits). Sign-extending a
value can be done by simply propagating the sign bit of the smaller-size
representation to *all the additional bits* in the larger-size
representation.

#### Two's complement

While ones' complement representation is practical and relatively easy to implement in hardware, it is not the simplest, and it's afflicted by the infamous ‘negative zero’ issue.

Because of this, two's complement representation, which is simpler to implement and has no negative zero, has gained much wider adoption. It also has the benefit of ‘integrating’ rather well with the equally common modular arithmetic.

In two's complement representation, the opposite of an *n*-bit value is
obtained by subtracting it from 2^*n*, or, equivalently, from flipping
the bits and then adding 1, discarding any carries beyond the *n*-th
bit. Using our usual 8-bit bytes as example, 1 will as usual be
`00000001`

, while -1 will be `11111111`

.

The largest positive representable number with *n* bits is still
2^{*n*-1}-1, but the largest (in magnitude) *negative* representable
number is now -2^{*n*-1}, and it's represented by a high-bit set to 1
and all other bits set to 0. For example, with 8-bit bytes the largest
positive number is 127, represented by `01111111`

, whose opposite -127 is
represented by `10000001`

, while the largest (in magnitude) negative
number is -128, represented by `10000000`

.

In two's complement representation, there is no negative zero and the only representation for 0 is given by all bits set to 0. However, as discussed earlier, this leads to a negative value whose opposite is the value itself, since the representation of largest (in magnitude) negative representable number is invariant by two's complement.

As in the other two representations, the most significant bit can be checked to see if a number is positive and negative. As in ones' complement case, sign-extension is done trivially by propagating the sign bit of the smaller-size value to all other bits of the larger-size value.

#### Offset binary

Offset binary (or biased representation) is quite different from the other representations, but it has some very useful properties that have led to its adoption in a number of schemes (most notably the IEEE-754 standard for floating-point representation, where it's used to encode the exponent, and some DSP systems).

Before getting into the technical details of offset binary, we look at a
possible *motivation* for its inception. The attentive reader will have
noticed that all the previously mentioned representations of signed
integers have one interesting property in common: they *violate* the
*natural ordering* of the representations.

Since the most significant bit is taken as the sign bit, and negative
numbers have a most significant bit set to one, natural ordering (by bit
patterns) puts them *after* the positive numbers, whose most significant
bit is set to 0. Additionally, in the sign bit and mantissa
representation, the ordering of negative numbers is reversed with
respect to the natural ordering of their representation. This means that
when comparing numbers it is important to know if they are signed or
unsigned (and if signed, which representation) to get the ordering
right. The biased representation is one way (and probably the most
straightforward way) to circumvent this.

The basic idea in biased representation or offset binary is to ‘shift’ the numerical value of all representations by a given amount (the bias or offset), so that the smallest natural representation (all bits 0) actually evaluates to the smallest representable number, and the largest natural representation (all bits 1) evaluates to the largest representable number.

The bias is the value that is added to the (representable) value to
obtain the representation, and subtracted from the representation to
obtain the represented value. The minimum representable number is then
the opposite of the bias. Of course, the range of representable numbers
doesn't change: if your data type can only represent 256 values, you can
only choose *which* 256 values, as long as they are consecutive
integers.

The bias in an offset binary representation can be chosen arbitrarily,
but there is a ‘natural’ choice for *n*-bit words, which is 2^{*n*-1}:
halfway through the natural representation. For example, with 8-bit
bytes (256 values) the natural choice for the bias is 128, leading to a
representable range of integers from -128 to 127, which looks distinctly
similar to the one that can be expressed in two's
complement representation.

In fact, the 2^{*n*-1} bias leads to a representation which is
*equivalent* to the two's complement representation, except for a
flipped sign bit, solving the famous signed versus unsigned comparison
issue mentioned at the beginning of this subsection.

As an example, consider the usual 8-bit bytes with a bias of 128: then,
the numerical values 1, 0 and -1 would be represented by the ‘natural’
representation of the values 129, 128 and 127 respectively, i.e.
`10000001`

, `10000000`

and `01111111`

: flipping the most significant bits, we
get `00000001`

, `00000000`

and `11111111`

which are the two's complement
representation of 1, 0 and -1.

Of course, the ‘natural’ bias is not the only option: it is possible to have arbitrary offsets, which makes offset binary extremely useful in applications where the range of possible values is strongly asymmetrical around zero, or where it is far from zero. Of course, such arbitrary biases are rarely supported in hardware, so operation on offset binary usually requires software implementations of even the most common operations, with a consequent performance hit. Still, assuming the hardware uses modular arithmetic, offset binary is at least trivial to implement for the basic operations.

One situation in which offset binary doesn't play particularly well is
that of sign-extension, which was trivial in ones' and two's complement
represnetations. The biggest issue in the case of offset binary is,
obviously, that the offsets in the smaller and larger data types are
likely going to be different, although usually not *arbitrarily*
different (biases are often related to the size of the data type).

At least in the case of the ‘natural’ bias (in both the smaller and larger data types), sign extension can be implemented straightforwardly by going through the two's complement equivalent representation: flip the most significant bit of the smaller data type, propagate it to all the remaining bits of the larger data type, and then flip the most significant bit of the larger data type. (In other words: convert to two's complement, sign extend that, convert back to offset binary with the ‘natural’ bias.)

## What does a bit pattern mean?

We're now nearing the end of our discussion on integer math on the
computers. Before getting into the messy details of the first common
non-integer operation (division), I would like to ask the following
question: what do you get if you do `10100101`

+ `01111111`

?

## Divide and despair

To conclude our exposition of the joys of integer math on the computers, we now discuss the beauty of integer division and the related modulus operation.

Since division of the integer *e* by the integer *o* only gives an
integer (mathematically) if *e* is a multiple of *o*, the concept of
‘integer division’ has arised in computer science as a way to obtain an
integer *d* from *e/o* even when *o* does not divide *e*.

### The simple case

Let's start by assuming that *e* is non-negative and *o* is (strictly)
positive. In this case, integer division gives the largest integer *d*
such that *d*o ≤ e*. In other words, the result of the division of *e*
by *o* is *truncated*, or ‘approximated by defect’, however small the
remainder might be: 3/5=0 and 5/3=1 with integer division, even though
in the latter case we would likely have preferred a value of 2 (think of
2047/1024, for example).

The upside of this choice is that it's trivial to implement other forms
of division (that round up, or to the nearest number, for example), by
simply adding appropriate correcting factors to the dividend. For
example, round-up division is achieved by adding the divisor diminished
by a unit to the divident: integer divisoin *(e + o - 1)/o* will give
you *e/o*, rounded *up*: (3+5-1)/5 = 7/5 = 1, and (5 + 3 - 1)/3 = 7/3 =
2.

### Division by zero

What happens when *o* is zero? Mathematically, division by zero is not
defined (although in some context where infinity is considered a valid
value, it may give infinity as a result —as long as the dividend is
non-zero). In hardware, anything can happen.

There's hardware that flags the error. There's hardware that produces bogus results without any chance of knowing that a division by zero happened. There's hardware that produces consistent results (always zero, or the maximum representable value), flagging or not flagging the situation.

‘Luckily’, most programming languages always treat a division by zero as an exception, which by default causes a program termination. Of course, this means that to write robust code it's necessary to sprinkle the code with conditionals to check that divisions will successfully complete.

### Negative numbers

If the undefined division by zero may not be considered a big issue per
se, the situation is *much* more interesting when either of the operands
of the division is a negative number.

First of all, one would be led to think that at least the sign of the result would be well defined: negative if the operands have opposite sign, positive otherwise. But this is not the case for the widespread two's complement representation with modular arithmetic, where the division of two negative numbers can give a negative number: of course, we're talking about the corner case of the largest (in magnitude) negative number, which when divided by -1 returns itself, since its opposite is not representable.

But even when the sign is correct, the result of integer division is not uniquely determined: some implementations round down, so that -7/5 = -2, while others round towards zero, so that -7/5 = -1: both the choices are consistent with the positive integer division, but the results are obviously different, which can introduce subtle but annoying bugs when porting code across different languages or hardware.

### Modulo

The modulo operation is perfectly well defined for positive integers,
as the reminder of (integer) division: the quotient *d* and the reminder
*r* of (integer) division *e/o* are (non-negative) integers such that
*e = o*d + r* and *r < o*.

Does the same hold true when either *e* or *o* are negative? It depends
on the convention adopted by the language and/or hardware. While for
negative integer division there are ‘only’ two standards, for the modulo
operation there are *three*:

- a result with the sign of the dividend;
- a result with the sign of the divisor;
- a result that is always non-negative.

In the first two cases, what it means is that, for example, -3 % 5 will
have the opposite sign of 3 % -5; hence, if one would satisfy the
quotient/

This could lead someone to think that the best choice would be a
rounding-down division with an always non-negative modulo. Too bad that
rounding-down division suffers from the problem that *-(e/o) ≠ (-e)/o*.

## Summary

Integer math on a computer is simple only as far as you never think
about dealing with corner cases, which you *should* if you want to write
robust, reliable code. With integer math, this is the minimum of what
you should be aware of: