| # Computation with less than 8 bits in gemmlowp |
| |
| ## Introduction |
| |
| We assume familiarity with gemmlowp's low-precision uint8 computation paradigm, |
| which is described in [low-precision.md](low-precision.md). |
| |
| This document is about the possibility of further reducing precision below 8 |
| bits. |
| |
| That allows to get higher arithmetic throughput on some architectures, at the |
| cost of decreased accuracy. |
| |
| ## The past, present, and future of less-than-8-bit computation in gemmlowp |
| |
| A meta note is needed here as to how this fits with the general gemmlowp design. |
| |
| ### The past |
| |
| Less-than-8-bit computation was initially designed and implemented in gemmlowp |
| as a drop-in replacement for regular 8bit computation, a plain optimization. The |
| idea was that to automatically requantize 8bit operands to less-than-8bit during |
| the O(N^2) packing stage, then take advantage of the lower bit depth during the |
| O(N^3) compute stage. For large enough matrices, that should be worth it. |
| |
| ### The present |
| |
| TODO(benoitjacob): update this documentation. This 'present' state just |
| became the past (February 2017). |
| |
| At the moment, this less-than-8-bit mode of gemmlowp is not much used in |
| practice, because the implicit requantization of operands from 8bit to |
| less-than-8bit turned out to be more expensive than initially expected, both in |
| terms of speed and accuracy: |
| |
| 1. Speed: the O(N^2) requantization is only negligible compared to the O(N^3) |
| compute kernel when the matrix size N is large enough; in practice, smaller |
| matrix sizes turned out to be very important, making the requantization |
| approach slower than expected. |
| |
| 2. Accuracy: As neural networks were optimized for size, their sensitivity to |
| numerical accuracy increased. Then the approach of requantizing |
| already-quantized data turned out to be more wasteful of accuracy than we |
| could afford. |
| |
| ### The future |
| |
| Less-than-8bit still probably has good prospects; what should be dropped is the |
| requantization. In other words, in the future, we might have neural networkds |
| trained right away for some bit depth lower than 8 bits. The resulting values |
| would probably still be stored as 8 bits (unless the bit depth eventually |
| becomes very low). Thus, no particular work would be needed in the packing |
| stage; no overhead or loss of accuracy would be incurred anymore. |
| |
| In other words: the design of less-than-8-bit kernels is probably useful in the |
| long run; what is on the way out is requantization and packing/unpacking-stage |
| aspects. |
| |
| With that said, the rest of this page retains its old content about the present |
| approach: |
| |
| ## Public interface |
| |
| ### The BitDepthSetting parameter in the EightBitIntGemm interface |
| |
| Accessing less-than-8-bit computation via the EightBitIntGemm is very simple: |
| EightBitIntGemm takes a BitDepthSetting enum which allows to choose among a |
| fixed set of supported bit-depth combinations. |
| |
| ### The BitDepthParams parameter in the public/gemmlowp.h interface |
| |
| The public/gemmlowp.h interface exposes more extensive control over |
| quantization, by means of a BitDepthParams template parameter, which is a type |
| parameter, carrying information about: 1. The LHS and RHS bit depth, which can |
| be set arbitrarily and independently; 2. The 'RoundingStrategy', which is the |
| heuristic used to choose a rounding mode, based on the accumulation size (a.k.a. |
| the "depth" dimension of the Gemm). Details can be seen in public/bit_depth.h. |
| |
| ### How does BitDepth{Setting,Params} affect input/output uint8 matrix data? |
| |
| Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of |
| the BitDepth{Setting,Params}. |
| |
| So the BitDepth{Setting,Params} is only an internal detail. It only means to |
| allow gemmlowp to use lower precision internally, but the input/output data |
| format is unaffected. |
| |
| As far as the API contract goes, the only thing that the |
| BitDepth{Setting,Params} does is to relax the accuracy requirement. With |
| standard 8bit/8bit computation, gemmlowp is required to return the exact result |
| as specified in [low-precision.md](low-precision.md). With lower bit depths, |
| gemmlowp is no longer required to return an exact result. |
| |
| ## Implementation |
| |
| Here we refer to the 3 stages of computation as described in |
| [design.md](design.md), namely: packing, computation kernel, unpacking. |
| |
| The general idea is that at the packing stage, we requantize input (Lhs/Rhs) |
| data to less-than-8-bit depths by scaling them, thus shrinking the range of the |
| packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits then |
| packed Rhs matrix entries will be in the range [0 ... 31]. This then allows the |
| GEMM kernel to use narrower accumulators without risking overflow, thus |
| achieving higher arithmetic throughput. Finally, at the unpacking stage, it only |
| remains to scale the result values to compensate for the scalings applied |
| earlier. |
| |
| Let us go into more detail for each of those stages: |
| |
| ### Packing stage |
| |
| The packing stage is where most of the work specific to the BitDepthParams takes |
| place. |
| |
| Here, we have to scale input matrix values from their original range of [0 ... |
| 255] to the range specified by the BitDepthParams, which is [0 ... (2^N)-1] |
| where N is the number of bits for the matrix at hand (Lhs or Rhs). For example, |
| for a bit depth of 5 bits, we need to scale down to [0 ... 31]. |
| |
| This scaling is what we call "requantization". The pedantic name matches the |
| fact that this is actually quite nontrivial to do correctly i.e. in such a way |
| that the result accuracy will be good enough for real-world applications. See |
| the section below on requantization details. |
| |
| Concretely, this work happens in PackingRegisterBlock::Pack(), which calls |
| Requantize(). This is in internal/pack.h. This code can be overridden for |
| specific architectures, see internal/pack_neon.h. |
| |
| This requantization work is costly and makes packing slower. This means that, at |
| least in our approach, less-than-8-bit computation is only interesting for |
| large-enough, square-enough GEMMs where packing is only a small fraction of the |
| overall cost. In cases where packing overhead is more prevalent (highly |
| rectangular cases), less-than-8-bit is probably a waste of time as long as we |
| treat it as an internal computation detail. What might help there, might be if |
| we shrunk the input/output data format to lower memory bandwidth usage. |
| |
| ### Computation kernel stage |
| |
| In principle, the computation kernel stage simply doesn't have to care about the |
| bit depth at all. In fact, on architectures where we do not have specific |
| optimized kernels for less-than-8-bit cases, we simply use our standard kernel |
| there, and that's correct! |
| |
| However, while the kernel doesn't have to know about the fact that the operands |
| are on less than 8 bits, it can use that information to make special |
| optimizations that would be incorrect in the general 8-bit case and become |
| correct here thanks to the more restricted range of inputs. That's the whole |
| point of this less-than-8-bit computation idea. |
| |
| With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries guaranteed |
| to be smaller than 2^M, each product is thus guaranteed to be smaller than |
| 2^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products and still be |
| guaranteed that such an accumulator will be smaller than 2^16, and thus can be |
| stored as a uint16 without risking overflow. |
| |
| For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the Rhs |
| entries are on 5 bits (M=5), so each product fits in 12 bits and one can thus |
| accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators with no risk |
| of overflow. |
| |
| This means that a computation kernel may use uint16 accumulators for several |
| loop iterations (16 in the above example), provided that it is allowed to assume |
| that inputs are in such restricted range. |
| |
| After this fixed number of loop iterations, the kernel must accumulate the local |
| uint16 accumulators back into global uint32 accumulators. |
| |
| On SIMD architectures with suitable uint16 arithmetic, this in principle allows |
| to multiply arithmetic throughput by up to 2x, since twice more accumulators now |
| fit in each SIMD vector register. This is partially offset by the cost of |
| accumulating back into global uint32 accumulators every several loop iterations, |
| but our experience on ARM NEON has been that we still get quite close to a 2x |
| speedup. See internal/kernel_neon.h, specifically |
| NEON32Kernel12x4Depth2Assuming12BitProducts. |
| |
| ### Unpacking stage |
| |
| At the unpacking stage, it only remains to scale the result values to compensate |
| for the scaling of the inputs. This is easier because now we are expanding the |
| range instead of shrinking it, so we don't need to worry about ways to minimize |
| a loss of accuracy. We simply need to multiply result values by a constant |
| fraction, rounding to nearest. |
| |
| Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and |
| (2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be by the |
| following factor: |
| |
| 255 * 255 |
| ----------------------------------- |
| (2^lhs_bits - 1) * (2^rhs_bits - 1) |
| |
| This is done by a MultiplyByConstantFraction function, see internal/unpack.h |
| |
| ## Requantization details |
| |
| Here we go into more detail on the Requantize() function used at the packing |
| stage to requantize input matrix data. See this function in internal/pack.h. |
| |
| It depends on the bit depth and on a rounding mode, and requantizes an input |
| value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N. |
| |
| ### Naive, bad rounding, that's plainly biased |
| |
| Naive and inaccurate ways to achieve this requantization include: 1. By shifting |
| bits rights by (8-N) bits; 2. By multiplying by ((2^N) - 1) and dividing by 255. |
| |
| Both of those are biased in some way: 1. has the wrong "derivative", since it |
| approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it |
| effectively implements rounding towards 0. |
| |
| In practice, both of the above requantization functions give results that are |
| too inaccurate in practice for the neural network that we tried (GoogLeNet). |
| |
| ### Round-to-nearest rounding: unbiased in principle but not in practice |
| |
| The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead of |
| rounding towards 0. This can be achieved by doing |
| |
| dst = (src * maxval + rounding_offset) / 255; |
| |
| Where maxval = ((2^N) - 1) is the highest requantized value, and the |
| rounding_offset can be set to |
| |
| rounding_offset = 127 |
| |
| to achieve rounding-to-nearest (while the above rounding towards 0 corresponded |
| to rounding_offset = 0). |
| |
| In principle, rounding-to-nearest is unbiased and optimal in various ways. |
| |
| In practice though, our input data is not random real numbers, but |
| already-quantized 8-bit values. That means that even in the best case, there |
| would be at most 255 different possible input values; in practice, we generally |
| see the input values distributed non-uniformly in that range, so that a majority |
| of input values tend to be in a much smaller range. See test/test_data.cc. |
| |
| Having a large part of the input values in a very small finite set, means that |
| the corresponding rounding errors are also in a very small finite set, which can |
| be small enough that the mean of these rounding errors is significantly |
| different from 0. That rounding-to-nearest is "unbiased" only means that over a |
| sufficiently large set of input values, the bias would become arbitrarily close |
| to 0; here, the set of input values is effectively small enough that the |
| resulting bias is significant. |
| |
| This leads to biasing the matrix product entries, resulting in an error that |
| grows linearly with the depth dimension of the GEMM. |
| |
| ### Probabilistic rounding: unbiased even on small finite input distributions |
| |
| To address that, we can instead use probabilistic rounding. The idea is that for |
| instance if we have to round the value 3.8 to the nearest integer, we can round |
| it to 3 with 20% probability and to 4 with probability 80%. If that value 3.8 |
| occurs many times, the mean requantized value will thus tend to 3.8. |
| |
| This amounts to keeping the above requantization formula, |
| |
| dst = (src * maxval + rounding_offset) / 255; |
| |
| but now the rounding_offset is a random value in [0 .. 254]. |
| |
| This guarantees zero bias no matter how small the distribution of input values |
| is. |
| |
| On the other hand, the variance of the error term here is higher than with |
| rounding-to-nearest --- one can check that it is 2x higher. |
| |
| So the error term coming from the Central Limit Theorem, which grows with the |
| square root of the accumulator depth i.e. the GEMM depth, will be 2x higher |
| here. |
| |
| Still, for large enough GEMM depth, that is better than rounding-to-nearest |
| which has an error term growing linearly with the GEMM depth. |
| |
| ### Switching between rounding-to-nearest and probabilistic rounding |
| |
| Thus, for fixed input values and bit depths, we expect that probabilistic |
| rounding will give more accurate results for large enough GEMM depths, while |
| rounding-to-nearest will be more accurate for smaller GEMM depths. |
| |
| That is why use switch between these rounding modes based on GEMM depth, see |
| ChooseRoundingMode in internal/bit_depth_util.h. |
| |
| It is based on a constant, kProbabilisticRoundingThreshold, defined in |
| internal/common.h and empirically determined. See the comment there. It would be |
| nice to better understand the statistics here and come up with better heuristics |
| for this switching. |
| |
| ### Choice of pseudorandom number generator |
| |
| We provide two PRNGs. The first is an 8-bit Xorshift. It is fast, naturally |
| produces values ranging over an interval of width 255, which is what we need |
| here (as opposed to an interval of width 256), and turns out, from empirical |
| tests, to produce better results than a linear congruential generator (LCG). |
| That's unfortunate, as a 8-bit LCG performs better (we confirmed that on various |
| ARM devices) but we need as perfect un-biased-ness as we can get. |
| |
| The second is an "add-mod" sequence generator, which generates non-random values |
| in the sequence x = (x + 97) % 255. This generates a low-discrepancy sequence |
| that minimizes the "clumpiness" of the random offsets (Thus, for example, |
| quantizing a 3x3 matrix will have a maximum additive error of about 200 from the |
| random offsets). While not random, this sequence performs well empirically for |
| many quantizations. (For information about why 97 is a good value, see |
| https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence and |
| http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html 97/255 = 0.38; |
| 0.382 is the best choice. For discrete numbers, the choice must be relatively |
| prime to the modulus. 97 is prime, so it is safely relatively prime to 255. 107 |
| is another near-optimal choice. |
| |
| The low-discrepancy sequence generator is the default. |
| |
| More details and results are given in a comment on the default PRNG in |
| internal/pack.h. Interested users can change the PRNG used by setting |
| DefaultRoundingGenerator in bit_depth_util.h. |