| # Building a quantization paradigm from first principles |
| |
| **TLDR:** If you prefer example code over theory, look at |
| [doc/quantization_example.cc](quantization_example.cc). |
| |
| ## Overview |
| |
| gemmlowp allows to perform calculations on matrices on uint8 values, but these |
| matrices are only useful insofar as they somehow approximate matrices of real |
| numbers. By a _quantization paradigm_ we mean a correspondence between matrices |
| of quantized 8bit values and matrices of real numbers. The choice of a |
| quantization paradigm affects the calculations that gemmlowp itself needs to |
| perform, specifically, it affects how one goes from internal 32bit accumulator |
| to final 8bit outputs. |
| |
| The part of gemmlowp transforming internal 32bit accumulator to final |
| 8bit outputs is the "output pipeline" described in [output.md](output.md). |
| |
| gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary |
| output pipeline, allowing the user to implement their own preferred quantized |
| arithmetic paradigm. |
| |
| In the present document, our purpose is to show how, reasoning from first |
| principles and some domain-specific knowledge of neural networks, we can arrive |
| naturally at some specific quantization paradigm, and how that can be |
| implemented using a specific output pipeline. |
| |
| We also aim to show how that differs from the older, legacy quantization |
| paradigm implemented by gemmlowp's legacy interfaces and why the change to the |
| newer quantization paradigm described in this document was useful as far as some |
| applications of gemmlowp were concerned. |
| |
| ## Quantization as an affine map. |
| |
| In order for arithmetic on real values to map directly to arithmetic on |
| quantized uint8 values, the mapping between real and quantized uint8 values must |
| be affine, which means it must be of the form |
| |
| ``` |
| real_value = A * quantized_value + B (1) |
| ``` |
| |
| for some constants A, B, or equivalently, of the form |
| |
| ``` |
| real_value = C * (quantized_value + D) (2) |
| ``` |
| |
| for some constants C, D. Indeed, anything else than such an affine map would |
| mean that the result of the quantized calculations do no longer readily provide |
| an approximation to the result of the real-numbers calculation. |
| |
| ## Domain-specific constraint: the real value 0 must be exactly representable. |
| |
| Here a domain-specific constrain from neural networks appears: for some neural |
| network layers, it is very useful for optimized implementations that the |
| real-value 0 be exactly representable. |
| |
| For instance, in a Convolutional or Pooling layer with padding, it is useful to |
| be able to implement the padding by zero-padding the input array, so that |
| optimized loops do not need to become more complex to avoid overrunning the |
| array bounds. |
| |
| In order for such zero-padding to be feasible in a quantized implementation of |
| such layers, it is important that the real value '0' be exactly representable in |
| quantized form, i.e. that it correspond exactly to some quantized value, which |
| we call the _zero-point_. |
| |
| Indeed, if '0' were not exactly representable, then we would have to use some |
| quantized value for padding, that does not exactly correspond to the real value |
| '0'. That would typically introduce inaccuracy in the result. In fact, using |
| always the same such value would be worse: it would introduce _bias_ in the |
| result. |
| |
| ## The final form of the quantization equation |
| |
| Now let us phrase what this constraint — that the real value 0 be exactly |
| representable — means in either quantization equations, (1) and (2). |
| |
| In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`, |
| we get: |
| |
| ``` |
| 0 = A * zero_point + B |
| ``` |
| |
| equivalently: |
| |
| ``` |
| zero_point = -B / A |
| ``` |
| |
| We are thus left with a rather awkward constraint: the real number `-B / A` must |
| somehow be guaranteed to be exactly integral, so that the special uint8 value |
| `zero_point` can be exactly equal to it. Quite awkward! |
| |
| Now let us look at equation (2). Plugging `real_value = 0` and |
| `quantized_value = zero_point`, we get: |
| |
| ``` |
| 0 = C * (zero_point + D) |
| ``` |
| |
| Conveniently, the constant `C` plays no role anymore, so this equation |
| simplifies to: |
| |
| ``` |
| 0 = zero_point + D |
| ``` |
| |
| In other words, `D = -zero_point`. This suggests rewriting the quantization |
| equation (2) into the following form (3), which will be the final form that we |
| will consistently use: |
| |
| ``` |
| real_value = scale * (quantized_value - zero_point) (3) |
| ``` |
| |
| To go from (2) to (3), we merely renamed `C` to `scale` and `D` to |
| `-zero_point`. |
| |
| With this quantization equation (3), the condition that 0 be exactly |
| representable is vacuously satisfied: `zero_point` is by definition one of the |
| possible `quantized_value`'s, and equation (3) maps it to a `real_value` of |
| exactly 0. |
| |
| Note that the final quantizaton equation (3) depends on two constants, one |
| integral, the other an arbitrary positive real number: |
| |
| * `zero_point` is integral, more specifically is one of the possible quantized |
| values (i.e. typically is a uint8 value). |
| * `scale` is a positive real number. Thus at this stage we have not yet shown |
| how to eliminate all usage of floating-point arithmetic. That will come |
| below. |
| |
| ## Quantizing a matrix multiplication |
| |
| Now that we know — equation (3) — how real numbers are to correspond |
| to quantized values (typically uint8), we turn to applying this knowledge to |
| rewriting a multiplication of matrices of real numbers, by the equivalent |
| multiplication of matrices of quantized values. |
| |
| Say that we have two matrices of real values `lhs_real_matrix`, |
| `rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many |
| products of individual matrix entries, say `lhs_real_value * rhs_real_value`. |
| |
| Now suppose that we have already quantized these two matrices according to the |
| above equation (3), with some already-known quantization parameters `lhs_scale`, |
| `rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries |
| are quantized as |
| |
| ``` |
| lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) |
| rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) |
| ``` |
| |
| We then rewrite the matrix product accumulator accordingly: |
| |
| ``` |
| result_real_value |
| = Sum_over_i(lhs_real_value[i] * rhs_real_value[i]) |
| = Sum_over_i( |
| lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) * |
| rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) |
| ) |
| = lhs_scale * rhs_scale * Sum_over_i( |
| (lhs_quantized_value[i] - lhs_zero_point) * |
| (rhs_quantized_value[i] - rhs_zero_point) |
| ) (4) |
| ``` |
| |
| Now our goal is to represent this result itself as a quantized matrix, i.e. |
| still according to equation (3), for some pre-established quantization |
| parameters `result_scale` and `result_zero_point`, as |
| |
| ``` |
| result_real_value = result_scale * |
| (result_quantized_value - result_zero_point) |
| ``` |
| |
| Here we need to keep in mind that our goal is to specify what the quantized |
| matrix multiplication should do, i.e. how to compute `result_quantized_value`. |
| The last equation above is equivalent to |
| |
| ``` |
| result_quantized_value = result_zero_point + |
| result_real_value / result_scale |
| ``` |
| |
| Now we can use equation (4) above to plug into this the expression of |
| result_real_value in terms of the quantized operands, and we obtain: |
| |
| ``` |
| result_quantized_value = result_zero_point + |
| (lhs_scale * rhs_scale / result_scale) * |
| Sum_over_i( |
| (lhs_quantized_value[i] - lhs_zero_point) * |
| (rhs_quantized_value[i] - rhs_zero_point) |
| ) (5) |
| ``` |
| |
| Equation (5) is the conclusion of this general discussion of how to specify what |
| "quantized matrix multiplication" should actually compute, in order to be able |
| to replace real matrix multiplications. |
| |
| ## Implementation of quantized matrix multiplication |
| |
| Having obtained the mathematical form (5) of quantized matrix multiplication, we |
| now turn to its actual implementation. |
| |
| The inner-most part of (5), |
| |
| ``` |
| int32_accumulator = |
| Sum_over_i( |
| (lhs_quantized_value[i] - lhs_zero_point) * |
| (rhs_quantized_value[i] - rhs_zero_point) |
| ) |
| ``` |
| |
| is the "kernel" accumulation loop. It is where the bulk of the computational |
| cost goes. Luckily, it only involves integers: the quantized operands matrix |
| entries, and their `zero_point` quantization parameters. Typically, all of these |
| values are uint8. Typically, the above differences of uint8 values would be |
| represented as signed int16; their products as signed int32. |
| |
| It is out of scope of the present doc to discuss how to avoid the overhead of |
| having to subtract these `zero_point` constants in this inner loop; refer to |
| [this section of |
| low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The |
| gist of it is that a mathematical trick allows us to take the handling of these |
| `zero_point` constants out of this accumulation loop, so that it simplifies to |
| |
| ``` |
| int32_accumulator = |
| Sum_over_i( |
| lhs_quantized_value[i] * |
| rhs_quantized_value[i] |
| ) (6) |
| ``` |
| |
| Anyway, the result is a `int32_accumulator` that we now plug back into the rest |
| of (5): |
| |
| ``` |
| result_quantized_value = result_zero_point + |
| (lhs_scale * rhs_scale / result_scale) * int32_accumulator (7) |
| ``` |
| |
| The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)` |
| is a positive real number, not an integer in general. It is a constant, though. |
| So what we have to implement here is the (approximate) scaling of a int32 value |
| by some arbitrary positive constant multiplier. |
| |
| Moreover, it is safe to assume that this positive constant multiplier is smaller |
| than one — each of the `scale` values here is typically smaller than one, |
| as we are typically mapping the `[0..255]` quantized uint8 value range to an |
| interval of real values that is much narrower than that, typically within |
| `[-10,10]` in most neural networks. For example, a neural network using Relu6 |
| activation functions will typically have real activation values in the interval |
| [0,6]. |
| |
| So how do we implement the multiplication of a int32 value by a positive real |
| constant that is smaller than one? Typically, by multiplying by a fixed-point |
| constant multiplier in the normalized interval `[1/2,1)`, and right-shifting |
| the result to achieve the correct multiplier. |
| |
| At this point we have obtained the int32 value of the product |
| |
| ``` |
| (lhs_scale * rhs_scale / result_scale) * int32_accumulator |
| ``` |
| |
| Looking at (7), it only remains to add to it the integral value |
| `result_zero_point`, and we are done. |
| |
| ## How this is implemented in gemmlowp |
| |
| The different parts of gemmlowp implementing aspects of the above discussion |
| are: |
| |
| * The packing stage (see [packing.md](packing.md)) implements the special |
| mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to |
| above, see [this section of |
| low-precision.md](low-precision.md#efficient-handling-of-offsets) for |
| details. Thanks to is, the rest of the calculation can proceed as if |
| `lhs_offset`, `rhs_offset` were 0. |
| |
| * The compute/kernel stage (see [kernel.md](kernel.md)) performs the core |
| accumulation loop producing the `int32_accumulator`, see equation (6) above. |
| |
| * The unpacking stage feeds into the output pipeline (see |
| [output.md](output.md)), which implements the rest of the evaluation of the |
| above equation (5), that we discussed in the previous section. |
| |
| Now, the point of gemmlowp's flexible output-pipelines mechanism (see |
| [output.md](output.md)) is to support different quantization paradigms, so we |
| now have to specify which particular flavor of output pipeline corresponds to |
| the particular quantization paradigm that we detailed above in this document. |
| |
| The specific output pipeline stage implementing the present quantization |
| paradigm, i.e. implementing the precise computation detailed in the previous |
| section (equation (5)), is |
| `OutputStageQuantizeDownInt32ByFixedPoint`. |
| |
| Please refer to the comment explaining it in |
| [public/output_stages.h](../public/output_stages.h). |
| |
| ## How this differs from the older legacy gemmlowp quantization paradigm |
| |
| The difference between the older legacy quantization paradigm described in |
| [low-precision.md](low-precision.md) and the newer one described in this |
| document boils down to the difference between the legacy output stage |
| implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output |
| stage implementing the new paradigm, |
| `OutputStageQuantizeDownInt32ByFixedPoint`. |
| |
| Please refer to the comments in |
| [public/output_stages.h](../public/output_stages.h) for details about these two |
| output stages and how they differ. |
| |
| Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are: |
| |
| 1. The int32 accumulators (inputs to the output stage) undergo a plain int32 |
| multiplication with a int32 multiplier, which may overflow. By contrast, in |
| the newer `OutputStageQuantizeDownInt32ByFixedPoint`, this |
| integer multiplication becomes a fixed-point multiplication and cannot |
| overflow. |
| |
| * In practice, to limit the risk of overflow, this pushes users to choose |
| smaller values for this integer multiplier, which means limited |
| multiplicative accuracy, which may cause multiplicative bias depending |
| on how it is used. |
| |
| 2. Note how the order of multiplying by the multipler and adding the |
| `result_offset` are swapped. This reflects a quantizatin equation of the |
| form (1) above, as opposed to the form (2)/(3) that the new quantization |
| paradigm uses. As a result, it is essentially impossible to guarantee that 0 |
| is an exactly-representable value, which as discussed above is an issue at |
| least in some convolutional neural network applications. |
| |
| ## Example code illustrating the new quantization paradigm |
| |
| Example code showing how to perfom a quantized matrix multiplication in the |
| quantization paradigm discussed here is in |
| [doc/quantization_example.cc](quantization_example.cc). |