| Quantization Accuracy Debugging |
| ------------------------------- |
| |
| This document provides high level strategies for improving quantization |
| accuracy. If a quantized model has error compared to the original model, |
| we can categorize the error into: |
| |
| 1. **data insensitive error** - caused by intrinsic model quantization error, |
| large portion of input data has large error |
| 2. **data sensitive error** - caused by outlier input data, small |
| portion of input data has large error |
| 3. **implementation error** - quantized kernel is not matching reference implementation |
| |
| Data insensitive error |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| General tips |
| ^^^^^^^^^^^^ |
| |
| 1. For PTQ, ensure that the data you are calibrating with is representative |
| of your dataset. For example, for a classification problem a general |
| guideline is to have multiple samples in every category, and the overall |
| number of samples should be at least 100. There is no penalty for |
| calibrating with more data other than calibration time. |
| 2. If your model has Conv-BN or Linear-BN patterns, consider fusing them. |
| If you are using FX graph mode quantization, this is done automatically |
| by the workflow. If you are using Eager mode quantization, you can do |
| this manually with the ``torch.ao.quantization.fuse_modules`` API. |
| 3. Increase the precision of dtype of the problematic ops. Usually, fp32 |
| will have the highest accuracy, followed by fp16, followed by dynamically |
| quantized int8, followed by statically quantized int8. |
| |
| 1. Note: this is trading off performance for accuracy. |
| 2. Note: availability of kernels per dtype per op can vary by backend. |
| 3. Note: dtype conversions add an additional performance cost. For example, |
| ``fp32_op -> quant -> int8_op -> dequant -> fp32_op -> quant -> int8_op -> dequant`` |
| will have a performance penalty compared to |
| ``fp32_op -> fp32_op -> quant -> int8_op -> int8_op -> dequant`` |
| because of a higher number of required dtype conversions. |
| |
| 4. If you are using PTQ, consider using QAT to recover some of the accuracy loss |
| from quantization. |
| |
| Int8 quantization tips |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| 1. If you are using per-tensor weight quantization, consider using per-channel |
| weight quantization. |
| 2. If you are doing inference on `fbgemm`, ensure that you set the `reduce_range` |
| argument to `False` if your CPU is Cooperlake or newer, and to `True` otherwise. |
| 3. Audit the input activation distribution variation across different samples. |
| If this variation is high, the layer may be suitable for dynamic quantization |
| but not static quantization. |
| |
| Data sensitive error |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| If you are using static quantization and a small portion of your input data is |
| resulting in high quantization error, you can try: |
| |
| 1. Adjust your calibration dataset to make it more representative of your |
| inference dataset. |
| 2. Manually inspect (using Numeric Suite) which layers have high quantization |
| error. For these layers, consider leaving them in floating point or adjusting |
| the observer settings to choose a better scale and zero_point. |
| |
| |
| Implementation error |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| If you are using PyTorch quantization with your own backend |
| you may see differences between the reference implementation of an |
| operation (such as ``dequant -> op_fp32 -> quant``) and the quantized implementation |
| (such as `op_int8`) of the op on the target hardware. This could mean one of two things: |
| |
| 1. the differences (usually small) are expected due to specific behavior of |
| the target kernel on the target hardware compared to fp32/cpu. An example of this |
| is accumulating in an integer dtype. Unless the kernel guarantees bitwise |
| equivalency with the reference implementation, this is expected. |
| 2. the kernel on the target hardware has an accuracy issue. In this case, reach |
| out to the kernel developer. |
| |
| Numerical Debugging Tooling (prototype) |
| --------------------------------------- |
| |
| .. toctree:: |
| :hidden: |
| |
| torch.ao.ns._numeric_suite |
| torch.ao.ns._numeric_suite_fx |
| |
| .. warning :: |
| Numerical debugging tooling is early prototype and subject to change. |
| |
| * :ref:`torch_ao_ns_numeric_suite` |
| Eager mode numeric suite |
| * :ref:`torch_ao_ns_numeric_suite_fx` |
| FX numeric suite |