doc/packing.md - platform/external/gemmlowp - Git at Google

 # The packing stage in gemmlowp

 ## Introduction

 We assume familiarity with [design.md](design.md) and with the overall 3 stages
 of computations described there: packing, kernel, unpacking.

 This page goes into more details about the first stage: packing.

 We also assume familiarity with [kernel.md](kernel.md) as it describes the
 packed format requirements that the kernels expect, and that forms basically the
 contract that the packing stage must honor.

 Some parts below also assume familiarity with
 [low-precision.md](low-precision.md) as the packing stage also has to compute
 the vectors of sums or columns as described there.

 ## The storage order of packed blocks, partly hidden behind sequential access

 As explained in [design.md](design.md), the primary purpose of packing is to
 ensure that when the kernel traverses Lhs/Rhs matrix data, it can do so
 efficiently thanks to having the data stored in an order that is as similar as
 possible to the order in which the compute stage has to traverse this data.

 This traversal order is nontrivial for the reasons outlined in
 [design.md](design.md): at the innermost level, one tries to work within
 registers as much as possible; at the next level, one tries to stay within L1
 cache as much as possible. The packed blocks that we handle are supposed to fit
 entirely in L2 cache.

 Thus it has become standard in GEMM design to describe complicated "Z-order" or
 "fractal order" storage for packed blocks.

 However, we should keep in mind that the whole point of the packed storage order
 is to be as similar as possible to the order of traversal during the compute
 stage. The storage order doesn't matter in itself; the only thing that matters
 is simple access patterns during the compute stage.

 This suggests the following approach to implementing packing: take the exact
 same hierarchy of nested loops of the compute stage, drop the loops that are not
 relevant to the side (Lhs or Rhs) being packed, and try to use mostly sequential
 access to the destination packed data.

 This hierarchy of nested loops can be seen in PackSideBlockImpl (PackL2, PackL1,
 PackRun), compare to the similar hierarchy of loops in internal/compute.h.

 In this way, the more intricate small-scale details or the packed data format
 never need to be made explicit (which would be complicated). We still use some
 "seeking" but only at larger scales, where the storage order is less \
 complicated to describe.

 ### Sequential access to PackedSideBlock data

 See PackedSideBlock in internal/pack.h, specifically the following data members:

 ```
 // Handle on the buffer backing this packed block. Owned.
 Allocator::Handle data_handle_;
 ```

 and:

 ```
 // pos_ is the current position in the buffer, which we access
 // sequentially, like a file.
 // The idea is that we pack data in the same order as it is
 // going to be traversed during the computation, which for
 // cache-friendliness reasons is complicated to random-access,
 // as the offsets calculations would be intricate. So we
 // give up random-access addressing, and instead content ourselves
 // with sequential access.
 //
 // pos_ is mutable because during the computation we will want to
 // be able to iterate on the data in a const PackedSideBlock.
 mutable int pos_;
 ```

 The methods exposing sequential access are:

 ```
 std::uint8_t* current_data() {
   return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
 }
 ```

 and:

 ```
 void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }

 void seek_forward_n_cells(int n) const {
   pos_ += n * KernelSideFormat::Cell::kSize;
 }
 ```

 ### Random access to PackedSideBlock data at larger scales

 We still need some random access at larger scales (with high granularity), which
 is unavoidable since GEMM is O(n^3) and has to traverse each of the O(n^2)
 inputs O(n) times.

 The watershed between sequential access and random access is at the level of a
 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer to the
 innermost GEMM loop in the depth dimension. That's the critical inner loop that
 must be as fast as possible, thus for which we absolutely want sequential access
 to packed data so that the storage order is optimal by construction. At larger
 scales i.e. between runs, we accept that the storage order is less optimal and
 since it's also less intricate, it's not too hard to implement random access
 there.

 This is done by the seek_run method:

 ```
 void seek_run(int start_width, int start_depth) const {
   int kernel_run_depth =
       std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
   pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
 }
 ```

 We see that the formula involves the l1_depth parameter, which is how the packed
 storage order depends on L1 cache size. Again, the whole packed block is
 supposed to fit in L2 cache.

 ## The innermost loop of the packing stage, PackRun, and PackingRegisterBlock

 Keeping with our consistent usage of the term 'Run' throughout gemmlowp, the
 innermost loop is called PackRun().

 Here we recall a very important principle that was explained in
 [kernels.md](kernels.md): the kernel is free to dictate the precise data format
 that it expects; the packing code has to honor it. So there's an asymmetry here:
 the kernel is the master, the packing is the slave. That's why the packing code
 is templatized in the KernelSideFormat. At larger scales, the packing is
 independent of kernel format details, but inside PackRun is where we take care
 of the small-scale details that do depend on the kernel format details. That's
 why it's a good thing that we only need sequential access here, as it would be
 very complicated to spell out random access at this scale.

 Anyway, PackRun.

 Since it is the critical inner loop, it is what we want to allow specializing
 for particular CPU architectures. To allow that, we handle at a time blocks of
 fixed dimensions, that is intended to be friendly enough to optimization. These
 blocks are PackingRegisterBlock's and their dimensions are:

 ```
   width = KernelWidth
   depth = kRegisterSize
 ```

 See [kernels.md](kernels.md) and internal/kernel.h for the former, and
 internal/common.h for the latter.

 See the comments around PackingRegisterBlock in internal/pack.h:

 ```
 // A PackingRegisterBlock is a small fixed-size block of a matrix being
 // packed. This class is the generic non-optimized implementation,
 // it is inherited by the generic implementation of PackingRegisterBlock,
 // which may be overriden by template specialization. Overriding it is how
 // one may provide optimized packing code paths.
 //
 // The packing of a block proceeds in two steps:
 //   1. Ensuring that we have a complete block of source data, i.e. a block of
 //      the compile-time prescribed size. This is where we handle unaligned
 //      boundaries: if we don't have a complete block of source data, then
 //      we copy and zero-extend it into a local temporary (complete_src_),
 //      see MakeCompleteSrc. In the generic case, we do have a complete block,
 //      so we just use it in-place, see UseCompleteSrcInPlace.
 //   2. Packing a complete block into the destination, see Pack. This is the
 //      most critical part, so it's convenient that unaligned boundaries have
 //      already been handled in step 1.
 ```

 ## Other things that the packing stage has to do

 Besides storing matrix entries in a suitable order, the packing stages also has
 two other things to do.

 First, packing has to compute the vectors of sums of entries along the depth
 dimension. If this is any mysterious, read [low-precision.md](low-precision.md).
 These will only be used at the unpacking stage.

 Second, if the BitDepthSetting requires less than 8 bit of precision, then at
 the packing stage we have to requantize inputs accordingly. See
 [less-than-8-bit.md](less-than-8-bit.md) for details. This is the Requantize()
 function.

 ## Specialized packing paths for specific formats on specific CPU architectures

 Please refer to internal/pack_neon.h for examples of doing that. The piece of
 code to be specialized is PackingRegisterBlock. However, inside of it, only the
 Pack() method typically needs to be specialized (the rest is unlikely to be
 critical). So one typically specializes PackingRegisterBlock but still
 inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
 typically wants to override the Pack() method.

 Template specialization for the right template parameters is how one specifies
 in which case a given path is to be used in place of the generic packing code.

 It is entirely possible to set the value of kRegisterSize differently based on
 the CPU architecture (for example, 32 on x86 with AVX) as long as all the
 specialized packing paths used on that CPU architecture are consistent with it.
	# The packing stage in gemmlowp

	## Introduction

	We assume familiarity with [design.md](design.md) and with the overall 3 stages
	of computations described there: packing, kernel, unpacking.

	This page goes into more details about the first stage: packing.

	We also assume familiarity with [kernel.md](kernel.md) as it describes the
	packed format requirements that the kernels expect, and that forms basically the
	contract that the packing stage must honor.

	Some parts below also assume familiarity with
	[low-precision.md](low-precision.md) as the packing stage also has to compute
	the vectors of sums or columns as described there.

	## The storage order of packed blocks, partly hidden behind sequential access

	As explained in [design.md](design.md), the primary purpose of packing is to
	ensure that when the kernel traverses Lhs/Rhs matrix data, it can do so
	efficiently thanks to having the data stored in an order that is as similar as
	possible to the order in which the compute stage has to traverse this data.

	This traversal order is nontrivial for the reasons outlined in
	[design.md](design.md): at the innermost level, one tries to work within
	registers as much as possible; at the next level, one tries to stay within L1
	cache as much as possible. The packed blocks that we handle are supposed to fit
	entirely in L2 cache.

	Thus it has become standard in GEMM design to describe complicated "Z-order" or
	"fractal order" storage for packed blocks.

	However, we should keep in mind that the whole point of the packed storage order
	is to be as similar as possible to the order of traversal during the compute
	stage. The storage order doesn't matter in itself; the only thing that matters
	is simple access patterns during the compute stage.

	This suggests the following approach to implementing packing: take the exact
	same hierarchy of nested loops of the compute stage, drop the loops that are not
	relevant to the side (Lhs or Rhs) being packed, and try to use mostly sequential
	access to the destination packed data.

	This hierarchy of nested loops can be seen in PackSideBlockImpl (PackL2, PackL1,
	PackRun), compare to the similar hierarchy of loops in internal/compute.h.

	In this way, the more intricate small-scale details or the packed data format
	never need to be made explicit (which would be complicated). We still use some
	"seeking" but only at larger scales, where the storage order is less \
	complicated to describe.

	### Sequential access to PackedSideBlock data

	See PackedSideBlock in internal/pack.h, specifically the following data members:

	```
	// Handle on the buffer backing this packed block. Owned.
	Allocator::Handle data_handle_;
	```

	and:

	```
	// pos_ is the current position in the buffer, which we access
	// sequentially, like a file.
	// The idea is that we pack data in the same order as it is
	// going to be traversed during the computation, which for
	// cache-friendliness reasons is complicated to random-access,
	// as the offsets calculations would be intricate. So we
	// give up random-access addressing, and instead content ourselves
	// with sequential access.
	//
	// pos_ is mutable because during the computation we will want to
	// be able to iterate on the data in a const PackedSideBlock.
	mutable int pos_;
	```

	The methods exposing sequential access are:

	```
	std::uint8_t* current_data() {
	return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
	}
	```

	and:

	```
	void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }

	void seek_forward_n_cells(int n) const {
	pos_ += n * KernelSideFormat::Cell::kSize;
	}
	```

	### Random access to PackedSideBlock data at larger scales

	We still need some random access at larger scales (with high granularity), which
	is unavoidable since GEMM is O(n^3) and has to traverse each of the O(n^2)
	inputs O(n) times.

	The watershed between sequential access and random access is at the level of a
	'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer to the
	innermost GEMM loop in the depth dimension. That's the critical inner loop that
	must be as fast as possible, thus for which we absolutely want sequential access
	to packed data so that the storage order is optimal by construction. At larger
	scales i.e. between runs, we accept that the storage order is less optimal and
	since it's also less intricate, it's not too hard to implement random access
	there.

	This is done by the seek_run method:

	```
	void seek_run(int start_width, int start_depth) const {
	int kernel_run_depth =
	std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
	pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
	}
	```

	We see that the formula involves the l1_depth parameter, which is how the packed
	storage order depends on L1 cache size. Again, the whole packed block is
	supposed to fit in L2 cache.

	## The innermost loop of the packing stage, PackRun, and PackingRegisterBlock

	Keeping with our consistent usage of the term 'Run' throughout gemmlowp, the
	innermost loop is called PackRun().

	Here we recall a very important principle that was explained in
	[kernels.md](kernels.md): the kernel is free to dictate the precise data format
	that it expects; the packing code has to honor it. So there's an asymmetry here:
	the kernel is the master, the packing is the slave. That's why the packing code
	is templatized in the KernelSideFormat. At larger scales, the packing is
	independent of kernel format details, but inside PackRun is where we take care
	of the small-scale details that do depend on the kernel format details. That's
	why it's a good thing that we only need sequential access here, as it would be
	very complicated to spell out random access at this scale.

	Anyway, PackRun.

	Since it is the critical inner loop, it is what we want to allow specializing
	for particular CPU architectures. To allow that, we handle at a time blocks of
	fixed dimensions, that is intended to be friendly enough to optimization. These
	blocks are PackingRegisterBlock's and their dimensions are:

	```
	width = KernelWidth
	depth = kRegisterSize
	```

	See [kernels.md](kernels.md) and internal/kernel.h for the former, and
	internal/common.h for the latter.

	See the comments around PackingRegisterBlock in internal/pack.h:

	```
	// A PackingRegisterBlock is a small fixed-size block of a matrix being
	// packed. This class is the generic non-optimized implementation,
	// it is inherited by the generic implementation of PackingRegisterBlock,
	// which may be overriden by template specialization. Overriding it is how
	// one may provide optimized packing code paths.
	//
	// The packing of a block proceeds in two steps:
	// 1. Ensuring that we have a complete block of source data, i.e. a block of
	// the compile-time prescribed size. This is where we handle unaligned
	// boundaries: if we don't have a complete block of source data, then
	// we copy and zero-extend it into a local temporary (complete_src_),
	// see MakeCompleteSrc. In the generic case, we do have a complete block,
	// so we just use it in-place, see UseCompleteSrcInPlace.
	// 2. Packing a complete block into the destination, see Pack. This is the
	// most critical part, so it's convenient that unaligned boundaries have
	// already been handled in step 1.
	```

	## Other things that the packing stage has to do

	Besides storing matrix entries in a suitable order, the packing stages also has
	two other things to do.

	First, packing has to compute the vectors of sums of entries along the depth
	dimension. If this is any mysterious, read [low-precision.md](low-precision.md).
	These will only be used at the unpacking stage.

	Second, if the BitDepthSetting requires less than 8 bit of precision, then at
	the packing stage we have to requantize inputs accordingly. See
	[less-than-8-bit.md](less-than-8-bit.md) for details. This is the Requantize()
	function.

	## Specialized packing paths for specific formats on specific CPU architectures

	Please refer to internal/pack_neon.h for examples of doing that. The piece of
	code to be specialized is PackingRegisterBlock. However, inside of it, only the
	Pack() method typically needs to be specialized (the rest is unlikely to be
	critical). So one typically specializes PackingRegisterBlock but still
	inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
	typically wants to override the Pack() method.

	Template specialization for the right template parameters is how one specifies
	in which case a given path is to be used in place of the generic packing code.

	It is entirely possible to set the value of kRegisterSize differently based on
	the CPU architecture (for example, 32 on x86 with AVX) as long as all the
	specialized packing paths used on that CPU architecture are consistent with it.