Quantization and overflow

In performing computations with a fixed- or floating-point digital processor, it is necessary to quantize numbers. Furthermore we have to take care about overflow, which is the phenomenon when the quantity $x$ has a value which is larger than the available boundaries to represent numbers.

Quantization

Quantization is the process in which the number $x$ is represented by another number $\hat{x}$, which is roughly equal to the value of $x$, but $\hat{x}$ can take less different values than $x$. The difference between these two numbers is the error $e=\hat{x}-x$.

With quantization step $q$ the Fig. 1 shows three common quantization characteristics: Rounding, Value truncation and Magnitude truncation.

Furthermore it is noted that quantization is a non-linear operation since the quantized result of an addition of two numbers $x$ and $y$ is not equal to the addition of the two individual quantized results, thus $$ \widehat{x+y} \neq \hat{x} + \hat{y} $$

Overflow

Another form of non-linearity that we can deal with is what we refer to as the concept of overflow, which is the phenomenon when the quantity $x$ has a value which is larger than the available boundaries to represent numbers, denoted by plus or minus the value $A$. The relation between $x$ and its truncated version $\hat{x}$ is called the overflow characteristic.

Fig. 2 shows three common overflow characteristics: Saturation, Nulling and Sawtooth overflow.

Last updated on Jun 13, 2020