doc/low-precision.md - platform/external/gemmlowp - Git at Google

 # The low-precision paradigm in gemmlowp, and how it's implemented

 ## Introduction

 "Low-precision" means that the input and output matrix entries are integers on
 at most 8 bits. The scalar type is uint8_t.

 This isn't the same as just doing plain matrix arithmetic over uint8_t, because
 that would overflow. To avoid overflow, we internally accumulate results on more
 than 8 bits, and at the end we keep only some significant 8 bits. This relies on
 the caller providing suitable offset/multiplier/shift parameters, which
 effectively govern how we extract some significant 8 bit from our more-than-8bit
 temporary accumulators.

 ## Low-precision paradigms

 gemmlowp is flexible enough to support multiple low-precision paradigms, i.e.
 multiple ways that a meaning is attached to 8bit values so that a computation
 can rely on a 8bit GEMM provided by gemmlowp.

 ### The current flexible design with arbitrary "output pipelines".

 See [output.md](output.md) for more details about output pipelines. This is a
 mechanism by which gemmlowp becomes generic enough to support multiple 8bit
 computation paradigms, by allowing the user to set up a chain of transformations
 to be performed on internal 32bit accumulators to obtain the final outputs.

 The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
 set up an arbitrary output pipeline is `GemmWithOutputPipeline`.

 Refer to [quantization.md](quantization.md) for details of how one gets from
 first principles to the actual output pipelines to assemble for successful
 real-world quantized calculations.

 For the scope of the present document, it suffices to say that quantized matrix
 multiplication takes the following parameters:

 -   The lhs matrix of uint8 quantized values.
 -   The rhs matrix of uint8 quantized values.
 -   A int32 lhs_offset, that will be added to each entry of the lhs matrix.
 -   A int32 rhs_offset, that will be added to each entry of the rhs matrix.
 -   An output pipeline, that will process int32 accumulators into final outputs.

 The overall computation goes through the following steps:

 1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
 2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
 3.  Compute the int32 matrix product of the resulting lhs times rhs.
 4.  Apply the output pipeline on these int32 accumulators, to obtain the final
     outputs.

 ### The legacy low-precision paradigm

 This older paradigm is the one exposed by the following entry points:

 *   In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
 *   The deprecateed `eight_bit_int_gemm` directory.

 Originally, gemmlowp started an implementation of the (now deprecated)
 EightBitIntGemm paradigm, where quantized matrix multiplication takes the
 following input parameters: - the lhs matrix of uint8 quantized values - the rhs
 matrix of uint8 quantized values - the following int32 "quantization
 parameters", which control how the uint8 quantized values in the matrices are to
 be interpreted during the matrix computation: - lhs_offset - rhs_offset -
 result_offset - result_mult_int - result_shift

 In that legacy paradigm, the mathematical expression to be computed is the
 result of the following steps:

 1.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
 2.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
 3.  Compute the int32 matrix product of the resulting lhs times rhs.
 4.  Add result_offset to each entry of the result.
 5.  Multiply each entry of the result by the following fraction, and round to
     the nearest integer:

 ```
 result_mult_int
 ---------------                             (1)
 2^result_shift
 ```

 1.  Clamp the resulting int32 values to the `[0..255]` range and cast to uint8.

 Again, this paradigm is not recommended for new usage. See
 [quantization.md](quantization.md) for how reasoning from first principles, one
 arrives to a substantially different quantization paradigm.

 In addition, note that the integer multiplication by the numerator in the above
 step 5. risks overflowing. That concern is avoided in the currently recommended
 output stages by performing a fixed-point multiplication instead of an ordinary
 integer multiplication.

 # Efficient handling of offsets

 At first glance it may seem like the above-described quantized computation
 scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs
 matrix entries.

 Doing that in the GEMM kernel would incur substantial overhead: - It would mean
 extra arithmetic work in the GEMM kernel; - It would require storing the
 lhs_offset and rhs_offset in registers, which would eat into the register space
 available for the rest of the GEMM kernel.

 One may then consider adding the lhs_offset and rhs_offset once and for all to
 lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one
 rhs block at a time. However, doing so would require storing lhs and rhs blocks
 in 32 bit (or at least in 16 bit in real-world cases), which would partially
 negate the memory bandwidth benefits of low-precision computation.

 Fortunately, there is another way to handle these offsets that has none of the
 costs of the approaches described above. The idea is as follows.

 Let `P` denote the matrix shaped like `lhs`, but filled with 1's.

 Let `Q` denote the matrix shaped like `rhs`, but filled with 1's.

 Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to
 `lhs`.

 Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to
 `rhs`.

 Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product
 to be computed is:

 ```
 (lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
 ```

 Expanding this (using distributivity of matrix multiplication over addition), we
 see that the above product is equal to the following sum of 4 terms:

 ```
   lhs * rhs                                 (2)
 + lhs_offset * P * rhs
 + lhs * rhs_offset * Q
 + lhs_offset * rhs_offset * P * Q
 ```

 The first term, `lhs * rhs`, is just the matrix multiplication ignoring the
 offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is
 all what we have to compute in the GEMM kernel.

 In the second term, `lhs_offset * P * rhs`, notice that since P is filled with
 1's, `P * rhs` has all its rows equal to each other, and equal to the row-vector
 of sums of all the entries in each column of rhs.

 Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each
 column of rhs. This produces a single row-vector, and in order to add the second
 term, we simply need to add this row-vector (multiplied by lhs_offset) to each
 row of the result. This is just a rank one update of the result (equivalently,
 the second term is a rank one matrix), and we can efficiently store it as a
 single vector.

 The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one,
 and can be similarly computed by summing each row of lhs, storing this in a
 single column-vector, and later multiplying these sums by rhs_offset.

 The fourth term is a single constant, repeated into all the entries of the
 matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the
 depth of the matrix product i.e. the number of columns of the lhs). Thus the
 fourth term is simply the rank zero update adding this constant to each matrix
 entry:

 ```
 lhs_offset * rhs_offset * depth
 ```

 # Implementation of this technique in gemmlowp

 In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
 to prepare them for efficient repeated traversal by the kernel), we compute the
 sum of each row of the lhs block and the sum of each column of the rhs block.

 See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
 following member:

 ```
 // Handle on the additional buffer backing the vector of sums of slices
 // associated with this block. Owned.
 Allocator::Handle sums_of_each_slice_handle_;
 ```

 sums_of_each_slice_handle_ is the handle to the buffer allocated to store the
 vector containing sums of rows of lhs, or of sums of columns of rhs.

 After these rank one updates have been computed at the packing stage, they are
 ignored at the compute kernel stage, since that stage is only concerned with the
 first of the four terms in (2); they are only used at the unpacking stage. See
 the default/reference implementation, `UnpackResultImpl`, in
 [internal/unpack.h](../internal/unpack.h).
	# The low-precision paradigm in gemmlowp, and how it's implemented

	## Introduction

	"Low-precision" means that the input and output matrix entries are integers on
	at most 8 bits. The scalar type is uint8_t.

	This isn't the same as just doing plain matrix arithmetic over uint8_t, because
	that would overflow. To avoid overflow, we internally accumulate results on more
	than 8 bits, and at the end we keep only some significant 8 bits. This relies on
	the caller providing suitable offset/multiplier/shift parameters, which
	effectively govern how we extract some significant 8 bit from our more-than-8bit
	temporary accumulators.

	## Low-precision paradigms

	gemmlowp is flexible enough to support multiple low-precision paradigms, i.e.
	multiple ways that a meaning is attached to 8bit values so that a computation
	can rely on a 8bit GEMM provided by gemmlowp.

	### The current flexible design with arbitrary "output pipelines".

	See [output.md](output.md) for more details about output pipelines. This is a
	mechanism by which gemmlowp becomes generic enough to support multiple 8bit
	computation paradigms, by allowing the user to set up a chain of transformations
	to be performed on internal 32bit accumulators to obtain the final outputs.

	The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
	set up an arbitrary output pipeline is `GemmWithOutputPipeline`.

	Refer to [quantization.md](quantization.md) for details of how one gets from
	first principles to the actual output pipelines to assemble for successful
	real-world quantized calculations.

	For the scope of the present document, it suffices to say that quantized matrix
	multiplication takes the following parameters:

	- The lhs matrix of uint8 quantized values.
	- The rhs matrix of uint8 quantized values.
	- A int32 lhs_offset, that will be added to each entry of the lhs matrix.
	- A int32 rhs_offset, that will be added to each entry of the rhs matrix.
	- An output pipeline, that will process int32 accumulators into final outputs.

	The overall computation goes through the following steps:

	1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
	2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
	3. Compute the int32 matrix product of the resulting lhs times rhs.
	4. Apply the output pipeline on these int32 accumulators, to obtain the final
	outputs.

	### The legacy low-precision paradigm

	This older paradigm is the one exposed by the following entry points:

	* In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
	* The deprecateed `eight_bit_int_gemm` directory.

	Originally, gemmlowp started an implementation of the (now deprecated)
	EightBitIntGemm paradigm, where quantized matrix multiplication takes the
	following input parameters: - the lhs matrix of uint8 quantized values - the rhs
	matrix of uint8 quantized values - the following int32 "quantization
	parameters", which control how the uint8 quantized values in the matrices are to
	be interpreted during the matrix computation: - lhs_offset - rhs_offset -
	result_offset - result_mult_int - result_shift

	In that legacy paradigm, the mathematical expression to be computed is the
	result of the following steps:

	1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
	2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
	3. Compute the int32 matrix product of the resulting lhs times rhs.
	4. Add result_offset to each entry of the result.
	5. Multiply each entry of the result by the following fraction, and round to
	the nearest integer:

	```
	result_mult_int
	--------------- (1)
	2^result_shift
	```

	1. Clamp the resulting int32 values to the `[0..255]` range and cast to uint8.

	Again, this paradigm is not recommended for new usage. See
	[quantization.md](quantization.md) for how reasoning from first principles, one
	arrives to a substantially different quantization paradigm.

	In addition, note that the integer multiplication by the numerator in the above
	step 5. risks overflowing. That concern is avoided in the currently recommended
	output stages by performing a fixed-point multiplication instead of an ordinary
	integer multiplication.

	# Efficient handling of offsets

	At first glance it may seem like the above-described quantized computation
	scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs
	matrix entries.

	Doing that in the GEMM kernel would incur substantial overhead: - It would mean
	extra arithmetic work in the GEMM kernel; - It would require storing the
	lhs_offset and rhs_offset in registers, which would eat into the register space
	available for the rest of the GEMM kernel.

	One may then consider adding the lhs_offset and rhs_offset once and for all to
	lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one
	rhs block at a time. However, doing so would require storing lhs and rhs blocks
	in 32 bit (or at least in 16 bit in real-world cases), which would partially
	negate the memory bandwidth benefits of low-precision computation.

	Fortunately, there is another way to handle these offsets that has none of the
	costs of the approaches described above. The idea is as follows.

	Let `P` denote the matrix shaped like `lhs`, but filled with 1's.

	Let `Q` denote the matrix shaped like `rhs`, but filled with 1's.

	Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to
	`lhs`.

	Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to
	`rhs`.

	Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product
	to be computed is:

	```
	(lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
	```

	Expanding this (using distributivity of matrix multiplication over addition), we
	see that the above product is equal to the following sum of 4 terms:

	```
	lhs * rhs (2)
	+ lhs_offset * P * rhs
	+ lhs * rhs_offset * Q
	+ lhs_offset * rhs_offset * P * Q
	```

	The first term, `lhs * rhs`, is just the matrix multiplication ignoring the
	offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is
	all what we have to compute in the GEMM kernel.

	In the second term, `lhs_offset * P * rhs`, notice that since P is filled with
	1's, `P * rhs` has all its rows equal to each other, and equal to the row-vector
	of sums of all the entries in each column of rhs.

	Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each
	column of rhs. This produces a single row-vector, and in order to add the second
	term, we simply need to add this row-vector (multiplied by lhs_offset) to each
	row of the result. This is just a rank one update of the result (equivalently,
	the second term is a rank one matrix), and we can efficiently store it as a
	single vector.

	The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one,
	and can be similarly computed by summing each row of lhs, storing this in a
	single column-vector, and later multiplying these sums by rhs_offset.

	The fourth term is a single constant, repeated into all the entries of the
	matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the
	depth of the matrix product i.e. the number of columns of the lhs). Thus the
	fourth term is simply the rank zero update adding this constant to each matrix
	entry:

	```
	lhs_offset * rhs_offset * depth
	```

	# Implementation of this technique in gemmlowp

	In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
	to prepare them for efficient repeated traversal by the kernel), we compute the
	sum of each row of the lhs block and the sum of each column of the rhs block.

	See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
	following member:

	```
	// Handle on the additional buffer backing the vector of sums of slices
	// associated with this block. Owned.
	Allocator::Handle sums_of_each_slice_handle_;
	```

	sums_of_each_slice_handle_ is the handle to the buffer allocated to store the
	vector containing sums of rows of lhs, or of sums of columns of rhs.

	After these rank one updates have been computed at the packing stage, they are
	ignored at the compute kernel stage, since that stage is only concerned with the
	first of the four terms in (2); they are only used at the unpacking stage. See
	the default/reference implementation, `UnpackResultImpl`, in
	[internal/unpack.h](../internal/unpack.h).