doc/public.md - platform/external/gemmlowp - Git at Google

 # Gemmlowp's public entry points

 gemmlowp's public interface is defined in
 [public/gemmlowp.h](../public/gemmlowp.h).

 ## GemmWithOutputPipeline

 The primary public entry point is: `GemmWithOutputPipeline`.

 A usage example is given in
 [doc/quantization_example.cc](quantization_example.cc).

 The high-level overview of how this specifies a low-precision matrix
 multiplication is explained in [low-precision.md](low-precision.md). The
 rationale for a specific quantization paradigm is given in
 [quantization.md](quantization.md). That specific quantization paradigm is
 implemented at two different stages of the computation: as pre-processing on
 the operands and as post-processing on the result:

 *   Pre-processing on the LHS, RHS operands, in the form of adding constant
     `lhs_offset`, `rhs_offset` to them, is explained in
     [low-precision.md](low-precision.md).

 *   Post-processing on the result, in the form of a flexible "output pipeline",
     is explained in [output.md](output.md).

 More details on this below as we discuss specific function parameters.

 The prototype is:

 ```
 template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
           MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
           typename OutputPipelineType, typename GemmContextType>
 void GemmWithOutputPipeline(GemmContextType* context,
                             const MatrixMap<const InputScalar, LhsOrder>& lhs,
                             const MatrixMap<const InputScalar, RhsOrder>& rhs,
                             MatrixMap<OutputScalar, ResultOrder>* result,
                             int lhs_offset, int rhs_offset,
                             const OutputPipelineType& output_pipeline);
 ```

 A typical call looks like (from the [usage example](quantization_example.cc)):

 ```
 gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
                                  gemmlowp::DefaultL8R8BitDepthParams>(
     &gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
     &uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
 ```

 ### Template parameters

 Typically only the 3 first template parameters need to be specified, the rest
 being automatically deduced from function parameters:

 *   `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
     this must be `std::uint8_t`.
 *   `OutputScalar`: The scalar type of the result. At the moment,
     this must be `std::uint8_t`.
 *   `BitDepthParams`: Defines the bit format of the input and output matrices
     and the required accuracy of the computation. At the moment, the only
     non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
     [less-than-8-bit.md](less-than-8-bit.md) for other values and the general
     idea of this, and how it may become more useful in the future.

 The other template parameters, which typically do not need to be specified, are:

 *   `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
     column-major) of the LHS, RHS, result matrices. See
     [public/map.h](../public/map.h). See the below performance note: we
     recommend using respectively RowMajor, ColMajor, ColMajor for optimal
     performance.
 *   `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
     See below explanation of the `output_pipeline` parameter, and
     [output.md](output.md).
 *   `GemmContextType`: the type of the `context` parameter. At the moment, this
     must be `gemmlowp::GemmContext`.

 ### Function parameters

 The function parameters taken by `GemmWithOutputPipeline` are:

 *   `context`: The `gemmlowp::GemmContext` object holding state and resources to
     be used for this gemmlowp call.
 *   `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
     `MatrixMap` objects, mapping external buffers as matrices, not owning data.
     See [public/map.h](../public/map.h).
 *   `result`: pointer to the destination `MatrixMap` object, which must be
     already constructed, wrapping the external destination buffer with the
     wanted destination matrix shape and storage layout. No memory allocation
     will be performed by gemmlowp for the destination buffer. See
     [public/map.h](../public/map.h).
 *   `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
     LHS, RHS matrices respectively, as explained in
     [low-precision.md](low-precision.md). This is only the part of the
     quantization paradigm explained in [quantization.md](quantization.md) that
     needs to be implemented as operations on the operands; everything else is
     operations on the result, see `output_pipeline`.
 *   `output_pipeline` is a `std::tuple` of output stages (see
     [public/output_stages.h](../public/output_stages.h)), specifying the output
     pipeline (see [output.md](output.md)). This is the part of the quantization
     paradigm explained in [quantization.md](quantization.md) that needs to be
     implemented as operations on the result matrix.

 ### Performance note on storage orders.

 gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
 result matrices. However, not all are equally optimized for.

 Because gemmlowp is primarily aimed at neural network inference workloads,
 optimization focus is on this particular combination of storage orders:

 *   `LhsOrder=RowMajor`
 *   `RhsOrder=ColMajor`
 *   `ResultOrder=ColMajor`

 The rationale is that the LHS is typically the constant weights of a neural
 network layer (e.g. the weights of a Convolutional layer implemented as a matrix
 multiplication), while the RHS and result are neural network activations,
 respectively the input and output activations of the layer.

 Because the RHS and result are activations, we want them to share the same
 storage order -- so that one layer's output activations can be readily used as
 the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.

 We also know from general considerations on matrix multiplication that it is
 slightly more efficient to have the direction of accumulation (the "depth"
 dimension) be the direction of contiguous storage in memory. That means that it
 is always going to be slightly easier and more efficient to have
 `LhsOrder=RowMajor` and `RhsOrder=ColMajor`.

 Putting this together, we arrive at gemmlowp's focus on the above-described
 combination of storage orders.

 Using other storage orders will typically mean taking less efficient paths in
 the packing and unpacking stages, see [packing.md](packing.md). The compute
 kernel stage ([kernel.md](kernel.md)) is unaffected.

 ## GemmWithOutputPipelinePC

 This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
 scalar. They are then broadcasted against LHS, RHS respectively.

 This is useful for some flavors of neural network inference with "per-channel
 quantization", whence the PC suffix. This has been useful in some settings where
 a neural network trained in float arithmetic was subsequently quantized. On the
 other hand, retraining neural networks for quantized inference tends to remove
 the need for per-channel quantization. For that reason, the long-term usefulness
 of this entry point is in question.

 ## Gemm

 This is gemmlowp's original, now legacy and deprecated, entry point. See the
 section of [low-precision.md](low-precision.md) on the legacy quantization
 paradigm. Avoid in new code.

 ## The eight_bit_int_gemm directory

 As explained in the top-level [README.md](../README.md#public-interfaces), this
 is entirely deprecated.
	# Gemmlowp's public entry points

	gemmlowp's public interface is defined in
	[public/gemmlowp.h](../public/gemmlowp.h).

	## GemmWithOutputPipeline

	The primary public entry point is: `GemmWithOutputPipeline`.

	A usage example is given in
	[doc/quantization_example.cc](quantization_example.cc).

	The high-level overview of how this specifies a low-precision matrix
	multiplication is explained in [low-precision.md](low-precision.md). The
	rationale for a specific quantization paradigm is given in
	[quantization.md](quantization.md). That specific quantization paradigm is
	implemented at two different stages of the computation: as pre-processing on
	the operands and as post-processing on the result:

	* Pre-processing on the LHS, RHS operands, in the form of adding constant
	`lhs_offset`, `rhs_offset` to them, is explained in
	[low-precision.md](low-precision.md).

	* Post-processing on the result, in the form of a flexible "output pipeline",
	is explained in [output.md](output.md).

	More details on this below as we discuss specific function parameters.

	The prototype is:

	```
	template <typename InputScalar, typename OutputScalar, typename BitDepthParams,
	MapOrder LhsOrder, MapOrder RhsOrder, MapOrder ResultOrder,
	typename OutputPipelineType, typename GemmContextType>
	void GemmWithOutputPipeline(GemmContextType* context,
	const MatrixMap<const InputScalar, LhsOrder>& lhs,
	const MatrixMap<const InputScalar, RhsOrder>& rhs,
	MatrixMap<OutputScalar, ResultOrder>* result,
	int lhs_offset, int rhs_offset,
	const OutputPipelineType& output_pipeline);
	```

	A typical call looks like (from the [usage example](quantization_example.cc)):

	```
	gemmlowp::GemmWithOutputPipeline<std::uint8_t, std::uint8_t,
	gemmlowp::DefaultL8R8BitDepthParams>(
	&gemm_context, uint8_lhs_matrix, uint8_rhs_matrix,
	&uint8_result_matrix, lhs_offset, rhs_offset, output_pipeline);
	```

	### Template parameters

	Typically only the 3 first template parameters need to be specified, the rest
	being automatically deduced from function parameters:

	* `InputScalar`: The scalar type of the LHS and RHS operands. At the moment,
	this must be `std::uint8_t`.
	* `OutputScalar`: The scalar type of the result. At the moment,
	this must be `std::uint8_t`.
	* `BitDepthParams`: Defines the bit format of the input and output matrices
	and the required accuracy of the computation. At the moment, the only
	non-deprecated valid value is `gemmlowp::DefaultL8R8BitDepthParams`. See
	[less-than-8-bit.md](less-than-8-bit.md) for other values and the general
	idea of this, and how it may become more useful in the future.

	The other template parameters, which typically do not need to be specified, are:

	* `LhsOrder`, `RhsOrder`, `ResultOrder`: the storage orders (row-major or
	column-major) of the LHS, RHS, result matrices. See
	[public/map.h](../public/map.h). See the below performance note: we
	recommend using respectively RowMajor, ColMajor, ColMajor for optimal
	performance.
	* `OutputPipelineType`: the actual `std::tuple` type of the output pipeline.
	See below explanation of the `output_pipeline` parameter, and
	[output.md](output.md).
	* `GemmContextType`: the type of the `context` parameter. At the moment, this
	must be `gemmlowp::GemmContext`.

	### Function parameters

	The function parameters taken by `GemmWithOutputPipeline` are:

	* `context`: The `gemmlowp::GemmContext` object holding state and resources to
	be used for this gemmlowp call.
	* `lhs`, `rhs`: The LHS and RHS operand matrices. Note that these are
	`MatrixMap` objects, mapping external buffers as matrices, not owning data.
	See [public/map.h](../public/map.h).
	* `result`: pointer to the destination `MatrixMap` object, which must be
	already constructed, wrapping the external destination buffer with the
	wanted destination matrix shape and storage layout. No memory allocation
	will be performed by gemmlowp for the destination buffer. See
	[public/map.h](../public/map.h).
	* `lhs_offset`, `rhs_offset` are constants added to each matrix entry in the
	LHS, RHS matrices respectively, as explained in
	[low-precision.md](low-precision.md). This is only the part of the
	quantization paradigm explained in [quantization.md](quantization.md) that
	needs to be implemented as operations on the operands; everything else is
	operations on the result, see `output_pipeline`.
	* `output_pipeline` is a `std::tuple` of output stages (see
	[public/output_stages.h](../public/output_stages.h)), specifying the output
	pipeline (see [output.md](output.md)). This is the part of the quantization
	paradigm explained in [quantization.md](quantization.md) that needs to be
	implemented as operations on the result matrix.

	### Performance note on storage orders.

	gemmlowp supports arbitrary combinations of storage orders for the LHS, RHS and
	result matrices. However, not all are equally optimized for.

	Because gemmlowp is primarily aimed at neural network inference workloads,
	optimization focus is on this particular combination of storage orders:

	* `LhsOrder=RowMajor`
	* `RhsOrder=ColMajor`
	* `ResultOrder=ColMajor`

	The rationale is that the LHS is typically the constant weights of a neural
	network layer (e.g. the weights of a Convolutional layer implemented as a matrix
	multiplication), while the RHS and result are neural network activations,
	respectively the input and output activations of the layer.

	Because the RHS and result are activations, we want them to share the same
	storage order -- so that one layer's output activations can be readily used as
	the next layer's input activations. Thus, we focus on `RhsOrder=ResultOrder`.

	We also know from general considerations on matrix multiplication that it is
	slightly more efficient to have the direction of accumulation (the "depth"
	dimension) be the direction of contiguous storage in memory. That means that it
	is always going to be slightly easier and more efficient to have
	`LhsOrder=RowMajor` and `RhsOrder=ColMajor`.

	Putting this together, we arrive at gemmlowp's focus on the above-described
	combination of storage orders.

	Using other storage orders will typically mean taking less efficient paths in
	the packing and unpacking stages, see [packing.md](packing.md). The compute
	kernel stage ([kernel.md](kernel.md)) is unaffected.

	## GemmWithOutputPipelinePC

	This is a variant where `lhs_offset` and `rhs_offset` may be vectors instead of
	scalar. They are then broadcasted against LHS, RHS respectively.

	This is useful for some flavors of neural network inference with "per-channel
	quantization", whence the PC suffix. This has been useful in some settings where
	a neural network trained in float arithmetic was subsequently quantized. On the
	other hand, retraining neural networks for quantized inference tends to remove
	the need for per-channel quantization. For that reason, the long-term usefulness
	of this entry point is in question.

	## Gemm

	This is gemmlowp's original, now legacy and deprecated, entry point. See the
	section of [low-precision.md](low-precision.md) on the legacy quantization
	paradigm. Avoid in new code.

	## The eight_bit_int_gemm directory

	As explained in the top-level [README.md](../README.md#public-interfaces), this
	is entirely deprecated.