| METAPROGRAMMED GEMM |
| =================== |
| |
| The two main goals of this library are: |
| - providing a new matrix multiplication kernel. |
| - providing the optimized codepaths for as many possible user scenarios without |
| enforcing additional input data constraints (padding, sizes, strides, layout) |
| |
| To enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup. |
| |
| The new kernel |
| -------------- |
| |
| The multiplication kernel - the most inner loop of the matrix multiplication, |
| which is responsible for the row/column products was rewritten. The new code |
| produces a 3x3 result patch and processes the row/column arrays in 8 element |
| packs (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using |
| specialized 8bit multiplication, aggregating to vector aggregators and then |
| reduction with parallel horizontal addition we devised code that achieved |
| higher arithmetical density (arithmetical operation per assembly instruction). |
| The arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla |
| Nexus 5 phone (which is practically peak for this device). |
| |
| In order to feed the kernel with input data and minimize the number of |
| instructions other than the arithmetical operations a different packing |
| scheme was used. Three rows (columns) are interweaved every 8 elements so that |
| they can be read from continuous memory in one op inside the kernel. Additional |
| memory preload hint operations are inserted into the kernel to hide memory |
| latency behind arithmetical operations. |
| |
| Generated code |
| -------------- |
| |
| The basic kernel used in this approach is of shape 3x3x8. Obviously this |
| kernel can be easily applied to multipications where matrix sizes are: |
| M x K, K x N where M and N are multiplies of 3 and K is a multiply of 8. |
| |
| We rejected two obvious solutions of: padding the matrix sizes to appropriate |
| sizes, or using the reference implementation for the leftovers. Neither did |
| we consider enforcing extra constraints on the caller. |
| |
| In order to allow all matrix sizes the kernels processing all combinations of |
| 1, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all |
| possible depths the leftover values (up to 7 elements) needed to be handled. |
| |
| Instead of writing those kernels by hand we decided to generate them with |
| some python scripts. 9 Versions of the multiplication kernel were prepared. |
| Additionally packing and unpacking code for different row/column counts and |
| depth leftovers was generated. Finally different code was generated for |
| aligned memory reads/writes and unaligned memory reads/writes. |
| |
| Using those multiplication and packing/unpacking primitives 144 gemm function |
| versions were prepared. On top of them one high level gemm function that would |
| switch to one of those preoptimized versions. |
| |
| This approach allowed moving all unnecessary branching and conditional execution |
| outside of the inner loops. It also allowed removing of all short loops required |
| for leftover handling. Finally aligned memory reads/writes are used everywhere |
| where the provided input data allows. |
| |
| Results |
| ------- |
| |
| The library shows up to 35% faster gemm execution in some cases (e.g. ImageNet |
| benchmark). |
| |
| Files |
| ----- |
| |
| single_thread_gemm.h |
| -- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the |
| optimized, unrolled and curried pack/unpack, and multiply procedures and |
| a single gemm function that switches between the optimized versions based |
| on the runtime parameters. |
| |
| multi_thread_gemm.h |
| -- a simple parallelization scheme for the gemm function. |
| |
| generators/gemm_NxMxK_neon.py |
| -- script that generates the single_thread_gemm.h header library. |
| Usage: python gemm_NxMxK_neon > single_thread_gemm.h |