| This document contains notes about various ideas that for one reason or another |
| are not being actively pursued. |
| |
| ## Next byte is non-ASCII after ASCII optimization |
| |
| The current plan for a SIMD-accelerated inner loop for handling ASCII bytes |
| makes no use of the bit of information that if the buffers didn't end but the |
| ASCII loop exited, the next byte will not be an ASCII byte. |
| |
| ## Handling ASCII with table lookups when decoding single-byte to UTF-16 |
| |
| Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16. |
| unconv doesn't even do anything fancy to manually unroll the loop (see below). |
| Both handle even the ASCII range using table lookup. That is, there's no branch |
| for checking if we're in the lower or upper half of the encoding. |
| |
| However, adding SIMD acceleration for the ASCII half will likely be a bigger |
| win than eliminating the branch to decide ASCII vs. non-ASCII. |
| |
| ## Manual loop unrolling for single-byte encodings |
| |
| ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte |
| encoding to UTF-16. This appears to be thanks to manually unrolling the |
| conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1]. |
| |
| [1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp |
| |
| Notably, none of the single-byte encodings have bytes that'd decode to the |
| upper half of BMP. Therefore, if the unmappable marker has the highest bit set |
| instead of being zero, the check for unmappables within a 16-character stride |
| can be done either by ORing the BMP characters in the stride together and |
| checking the high bit or by loading the upper halves of the BMP charaters |
| in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8` |
| / `pmovmskb` SSE2 instruction. |
| |
| ## After non-ASCII, handle ASCII punctuation without SIMD |
| |
| Since the failure mode of SIMD ASCII acceleration involves wasted aligment |
| checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin |
| scripts have runs of non-ASCII even if ASCII spaces and punctuation is used, |
| consider handling the next two or three bytes following non-ASCII as non-SIMD |
| before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if |
| there's ASCII that's not space or punctuation. Maybe with the "space or |
| punctuation" check in place, this code can be allowed to be in place even for |
| UTF-8 and Latin single-byte (i.e. not having different code for Latin and |
| non-Latin single-byte). |
| |
| ## Prefer maintaining aligment |
| |
| Instead of returning to acceleration directly after non-ASCII, consider |
| continuing to the alignment boundary without acceleration. |
| |
| ## Read from SIMD lanes instead of RAM (cache) when ASCII check fails |
| |
| When the SIMD ASCII check fails, the data has already been read from memory. |
| Test whether it's faster to read the data by lane from the SIMD register than |
| to read it again from RAM (cache). |
| |
| ## Use Level 2 Hanzi and Level 2 Kanji ordering |
| |
| These two are ordered by radical and then by stroke count, so in principle, |
| they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't |
| fully Unicode-ordered. Is "mostly" good enough for encode accelelation? |
| |
| ## Create a `divmod_94()` function |
| |
| Experiment with a function that computes `(i / 94, i % 94)` more efficiently |
| than generic code. |
| |
| ## Align writes on Aarch64 |
| |
| On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112 |
| ), it might be a good idea to move the destination into 16-byte alignment. |
| |
| ## Unalign UTF-8 validation on Aarch64 |
| |
| Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns |
| reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!) |
| |
| ## Table-driven UTF-8 validation |
| |
| When there are at least four bytes left, read all four. With each byte |
| index into tables corresponding to magic values indexable by byte in |
| each position. |
| |
| In the value read from the table indexed by lead byte, encode the |
| following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional |
| bits one of which is set to indicate the type of lead byte (8 valid |
| types, in the 8 lowest bits, and invalid, ASCII would be tenth type), |
| and the mask for extracting the payload bits from the lead byte |
| (for conversion to UTF-16 or UTF-32). |
| |
| In the tables indexable by the trail bytes, in each positions |
| corresponding byte the lead byte type, store 1 if the trail is |
| invalid given the lead and 0 if valid given the lead. |
| |
| Use the low 8 bits of the of the 16 bits read from the first |
| table to mask (bitwise AND) one positional bit from each of the |
| three other values. Bitwise OR the results together with the |
| bit that is 1 if the lead is invalid. If the result is zero, |
| the sequence is valid. Otherwise it's invalid. |
| |
| Use the advance to advance. In the conversion to UTF-16 or |
| UTF-32 case, use the mast for extracting the meaningful |
| bits from the lead byte to mask them from the lead. Shift |
| left by 6 as many times as the advance indicates, etc. |