doc/design.md - platform/external/gemmlowp - Git at Google

 # Overview of gemmlowp design

 ## Primer on GEMM, kernels, and cache friendliness

 gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
 algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
 matrices. Because the arithmetic complexity grows quicker than the memory
 complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
 accessed n times). A large part of a GEMM's performance and design goes toward
 minimizing the inefficiency resulting from these redundant memory accesses.

 Ultimately, once values are loaded into CPU registers, they cost nothing to
 access, so as long as we can work within registers, this problem doesn't exist.
 Thus, in order to be efficient, a GEMM's inner loops must wisely use the
 available registers to do as much arithmetic work as possible before loading
 more data from memory into registers. This means that a GEMM implementation
 needs to have architecture-specific inner loops tailored for architecture
 details such as the number of registers, and typically written in assembly. This
 'inner loops' architecture-specific component is referred to as the GEMM kernel.
 (More details about kernels are in [kernel.md](kernel.md)).

 However, only small blocks can fit at a given time in registers, so at larger
 scales one needs to repeatedly load blocks of matrices from memory, and these
 accesses are redundant for the reason outlined above. The way that one minimizes
 the resulting inefficiency is by organizing for cache locality, so that most of
 these accesses hit the L1 cache, and most of the remaining ones hit the L2
 cache, etc.

 This is achieved by subdividing the matrices into blocks sized to fit in L2
 cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
 and performing the matrix multiplication one such block at a time.

 In practice, it tends to pay off to "pack" input blocks for optimally efficient
 traversal by the kernel, since they will be traversed multiple times. "packing"
 means at least reordering the data layout for 1) simple access patterns that fit
 the CPU's cache behavior (in particular, the cache line size), and 2) simple
 loading into SIMD vector registers by the kernel.

 So a typical GEMM, in pseudo-code, tends to look like this:

 ```
 allocate(some_lhs_L2_block);
 allocate(some_rhs_L2_block);
 for (some_lhs_L2_block) {
   pack(some_lhs_L2_block);
   for (some_rhs_L2_block) {
     pack(some_rhs_L2_block);
     for (some_lhs_sub_block in some_lhs_L2_block) {
       for (some_rhs_sub_block in some_rhs_L2_block) {
         kernel(some_lhs_sub_block, some_rhs_sub_block);
       }
     }
   }
 }
 ```

 ## Impact of low-precision computation on gemmlowp design

 Refer to [low-precision.md](low-precision.md) for specifics of the
 low-precision-computation paradigm and how it's implemented in gemmlowp.

 Inputs and outputs are matrices of uint8 values, but internally we are
 accumulating int32 values, only converting them back to uint8 at the end. This
 means that we need so store a block of int32 accumulators at a time. We compute
 a block of the result in int32 accumulators and then we "unpack" it into the
 destination matrix at once. In this way, we minimize the amount of memory used
 to store int32 values at a given time.

 Because of that, besides the "pack" and "kernel" stages outlined above, a third
 stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
 3-stage computation scheme that gemmlowp uses:

 1.  Pack lhs/rhs blocks from the input matrices.
 2.  Compute the product of the packed blocks, using the kernel.
 3.  Unpack the result block into the output matrix.

 The pseudo-code overview of gemmlowp now looks like:

 ```
 allocate(some_lhs_L2_block);
 allocate(some_rhs_L2_block);
 // new: temp storage for int32 accums
 allocate(some_int32_accumulators_block);
 for (some_lhs_L2_block) {
   pack(some_lhs_L2_block);
   for (some_rhs_L2_block) {
     pack(some_rhs_L2_block);
     for (some_lhs_sub_block in some_lhs_L2_block) {
       for (some_rhs_sub_block in some_rhs_L2_block) {
         // new: pass int32 accums to kernel
         kernel(&some_int32_accumulators_block,
                some_lhs_sub_block,
                some_rhs_sub_block);
       }
     }
     // new: unpack int32 accums into destination matrix
     unpack(some_int32_accumulators_block);
   }
 }
 ```

 ## Exploring gemmlowp code

 The design outlined above can be readily matched to gemmlowp source code, in
 particular in this file, which gives a simple GEMM implementation fitting in one
 rather small function:

 ```
 internal/single_thread_gemm.h
 ```

 The reader can compare the above pseudo-code to the actual code in this file:

 ```
 for (int r = 0; r < rows; r += block_params.l2_rows) {
   int rs = std::min(block_params.l2_rows, rows - r);

   PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));

   for (int c = 0; c < cols; c += block_params.l2_cols) {
     int cs = std::min(block_params.l2_cols, cols - c);

     if (!pack_rhs_once) {
       PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
     }

     Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);

     auto result_block = result->block(r, c, rs, cs);
     UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
                  result_offset, result_mult_int, result_shift);
   }
 }
 ```

 The files in `internal/` fall into a few categories:

 There are two top-level GEMM implementations,

 *   [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
 *   [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)

 They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
 [packing.md](packing.md)) implemented in the following files:

 *   [internal/pack.h](../internal/pack.h)
 *   [internal/compute.h](../internal/compute.h)
 *   [internal/unpack.h](../internal/unpack.h)
     *   This in turn calls into [internal/output.h](../internal/output.h) for
         the output pipeline (see [output.md](output.md))

 The pack.h and unpack.h files contain generic templated code that can be
 overridden by optimized code in template specializations; for example, see the
 NEON optimized code here:

 *   [internal/pack_neon.h](../internal/pack_neon.h)
 *   [internal/unpack_neon.h](../internal/unpack_neon.h)
     *   This in turn calls into
         [internal/output_neon.h](../internal/output_neon.h)

 The compute stage contains generic code in compute.h that only calls into
 optimized code through the Kernel::Run() entry point. Each kernel is basically
 just as struct offering a Run() implementation; see the NEON kernels in:

 *   [internal/kernel_neon.h](../internal/kernel_neon.h)
	# Overview of gemmlowp design

	## Primer on GEMM, kernels, and cache friendliness

	gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
	algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
	matrices. Because the arithmetic complexity grows quicker than the memory
	complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
	accessed n times). A large part of a GEMM's performance and design goes toward
	minimizing the inefficiency resulting from these redundant memory accesses.

	Ultimately, once values are loaded into CPU registers, they cost nothing to
	access, so as long as we can work within registers, this problem doesn't exist.
	Thus, in order to be efficient, a GEMM's inner loops must wisely use the
	available registers to do as much arithmetic work as possible before loading
	more data from memory into registers. This means that a GEMM implementation
	needs to have architecture-specific inner loops tailored for architecture
	details such as the number of registers, and typically written in assembly. This
	'inner loops' architecture-specific component is referred to as the GEMM kernel.
	(More details about kernels are in [kernel.md](kernel.md)).

	However, only small blocks can fit at a given time in registers, so at larger
	scales one needs to repeatedly load blocks of matrices from memory, and these
	accesses are redundant for the reason outlined above. The way that one minimizes
	the resulting inefficiency is by organizing for cache locality, so that most of
	these accesses hit the L1 cache, and most of the remaining ones hit the L2
	cache, etc.

	This is achieved by subdividing the matrices into blocks sized to fit in L2
	cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
	and performing the matrix multiplication one such block at a time.

	In practice, it tends to pay off to "pack" input blocks for optimally efficient
	traversal by the kernel, since they will be traversed multiple times. "packing"
	means at least reordering the data layout for 1) simple access patterns that fit
	the CPU's cache behavior (in particular, the cache line size), and 2) simple
	loading into SIMD vector registers by the kernel.

	So a typical GEMM, in pseudo-code, tends to look like this:

	```
	allocate(some_lhs_L2_block);
	allocate(some_rhs_L2_block);
	for (some_lhs_L2_block) {
	pack(some_lhs_L2_block);
	for (some_rhs_L2_block) {
	pack(some_rhs_L2_block);
	for (some_lhs_sub_block in some_lhs_L2_block) {
	for (some_rhs_sub_block in some_rhs_L2_block) {
	kernel(some_lhs_sub_block, some_rhs_sub_block);
	}
	}
	}
	}
	```

	## Impact of low-precision computation on gemmlowp design

	Refer to [low-precision.md](low-precision.md) for specifics of the
	low-precision-computation paradigm and how it's implemented in gemmlowp.

	Inputs and outputs are matrices of uint8 values, but internally we are
	accumulating int32 values, only converting them back to uint8 at the end. This
	means that we need so store a block of int32 accumulators at a time. We compute
	a block of the result in int32 accumulators and then we "unpack" it into the
	destination matrix at once. In this way, we minimize the amount of memory used
	to store int32 values at a given time.

	Because of that, besides the "pack" and "kernel" stages outlined above, a third
	stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
	3-stage computation scheme that gemmlowp uses:

	1. Pack lhs/rhs blocks from the input matrices.
	2. Compute the product of the packed blocks, using the kernel.
	3. Unpack the result block into the output matrix.

	The pseudo-code overview of gemmlowp now looks like:

	```
	allocate(some_lhs_L2_block);
	allocate(some_rhs_L2_block);
	// new: temp storage for int32 accums
	allocate(some_int32_accumulators_block);
	for (some_lhs_L2_block) {
	pack(some_lhs_L2_block);
	for (some_rhs_L2_block) {
	pack(some_rhs_L2_block);
	for (some_lhs_sub_block in some_lhs_L2_block) {
	for (some_rhs_sub_block in some_rhs_L2_block) {
	// new: pass int32 accums to kernel
	kernel(&some_int32_accumulators_block,
	some_lhs_sub_block,
	some_rhs_sub_block);
	}
	}
	// new: unpack int32 accums into destination matrix
	unpack(some_int32_accumulators_block);
	}
	}
	```

	## Exploring gemmlowp code

	The design outlined above can be readily matched to gemmlowp source code, in
	particular in this file, which gives a simple GEMM implementation fitting in one
	rather small function:

	```
	internal/single_thread_gemm.h
	```

	The reader can compare the above pseudo-code to the actual code in this file:

	```
	for (int r = 0; r < rows; r += block_params.l2_rows) {
	int rs = std::min(block_params.l2_rows, rows - r);

	PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));

	for (int c = 0; c < cols; c += block_params.l2_cols) {
	int cs = std::min(block_params.l2_cols, cols - c);

	if (!pack_rhs_once) {
	PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
	}

	Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);

	auto result_block = result->block(r, c, rs, cs);
	UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
	result_offset, result_mult_int, result_shift);
	}
	}
	```

	The files in `internal/` fall into a few categories:

	There are two top-level GEMM implementations,

	* [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
	* [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)

	They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
	[packing.md](packing.md)) implemented in the following files:

	* [internal/pack.h](../internal/pack.h)
	* [internal/compute.h](../internal/compute.h)
	* [internal/unpack.h](../internal/unpack.h)
	* This in turn calls into [internal/output.h](../internal/output.h) for
	the output pipeline (see [output.md](output.md))

	The pack.h and unpack.h files contain generic templated code that can be
	overridden by optimized code in template specializations; for example, see the
	NEON optimized code here:

	* [internal/pack_neon.h](../internal/pack_neon.h)
	* [internal/unpack_neon.h](../internal/unpack_neon.h)
	* This in turn calls into
	[internal/output_neon.h](../internal/output_neon.h)

	The compute stage contains generic code in compute.h that only calls into
	optimized code through the Kernel::Run() entry point. Each kernel is basically
	just as struct offering a Run() implementation; see the NEON kernels in:

	* [internal/kernel_neon.h](../internal/kernel_neon.h)