Quantization Basic

Huggingface Docs

Quantization

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

Theory

Common lower precision data types

data type accumulation data type
float16 float16
bfloat16 float32
int16 int32
int8 int32

The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc) values of the data type

Quantization

The two most common quantization cases

Quantization to float16

Question:

Quantization to int8 (affine quantization sheme)

Only 256 values can be represented in int8, while float32 can represent a very wide range of values.

The idea is to find the best way to project our range [a, b] of float32 values to the int8 space.

Consider a float x in [a, b]:

\[x = S * (x_q - Z)\]

Thus the quantizaed value can be computed as follows:

\[x_q = \mathrm{round}(x/S + Z)\]

And float32 values outside of the [a, b] range are clipped to the closest representable value:

\[x_q = \mathrm{clip}(\mathrm{round}(x/S + Z), \mathrm{round}(a/S + Z), \mathrm{round}(b/S + Z))\]

Symmetric and Affine Quantization Schemes

speedup

Calibration

Calibration is the step during quantization where the float32 ranges are computed.

For weights it is quite easy since the actual range is known at quantization-time.

But it is less clear for activations, and different approaches exist:

Calibration techniques

How do machines represent numbers?

Real numbers representation

For a real number x we have

\[x = sign \times mantissa \times (2^{exponent})\]

Going Further

Lei Mao’s blog about Quantization for Neural Networks.